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distributed  systems  toolkit 

htii* a  cohltri:?:' 

1-  LM  of  Documents 

The  available  documentation  is  as  follows: 

Addressing  -  how  processes  are  addressed  in  KW 

Auther  -  a  tool  used  by  process  to  protect  again*!  unauthorized  requests 

Bboards  -•  a  very  high  level  bulletin  board”  fadlity. 

Beast  --  how  to  do  broadcasts  in  ISIS. 

Bitvsc  —  bow  bit  vectors  are  used  in  the  site- views  data  structure. 

Config  —  configuration  data  structure  manager 

Compile  -  how  to  compile  ISIS  dieat  programs  mvW  UNIX 

Coord  -  coordinator -cohort  routines. 

Entries  -  how  to  define  the  entry  points  to  a  program. 

Files  -  files  the  system  uses,  and  how  to  start  the  system  up. 

Filter  -  a  technical  discussion  of  message  filters,  not  for  novices. 

Init  -  how  to  call  the  isisJnitO  routine. 

.  i-ogs  -  Some  suggestions  on  how  to  obtain  “recoverable”  replicated  data 

Messages  -  aD  you  need  to  know  about  messages  in  ISIS 

Msg  -  the  actual  message  editing  routines,  summarized.'1 
News  -  a  bulletin  board  facility. 

Pgroup  -  all  about  process  groups  and  process  group  views. 

Protection  —  overview  protection  features  in  BIS. 

Recovery  —  a  short  example  on  using  the  recovery  manager. 

Replication  -  a  general  purpose  replicated  data  management  tool. 

Rexec  -  remote  execution  facility. 

Rmgr  -  recovery  manager  (restarts  you  after  a  crash). 

Rmupdate  -  utility  far  use  with  the  recovery  manager. 

Serna  —  semaphores  for  synchronization. 

Startup  -  a  long  example  on  starting  up  an  ISIS  program 
State_xfer  —  a  state  transfer  utility,  very  useful. 

Sview  -  site  viewi . 

Tasks  —  a  lightweight  task  mechanism  that  you  currently  have  to  use. 
Transactions  -  a  transaction  facility. 

Vsync  -  a  discussion  of  virtual  synchrony  and  how  to  exploit  it. 

Watch  -  a  facility  for  watching  for  a  desired  (or  feared)  event. 

2.  Design  philosophy 


Although  the^  documents  are  presented  in  alphabetical  order,  there  L.  a  good  order  in  wtriefa 
to  read  them  in.  Before  starting,  we  recommend  that  you  learn  a  little  about  BIS.  Some  of  the 
recent  papers  would  be  fine,  or  one  of  the  short  overviews  we  have  generated  over  the  past  few 
years.  Once  you  drink  vou  have  the  picture,  start  by  reading  about  «Hrirming  messages,  tasks 
process  groups,  and  orries.  Then  read  about  compiling,  system  files,  the  init  routine,  and  the 
firing  called  startup  ,  v-tnch  describes  a  long  example.  Documentation  on  the  various  toolkit  rou- 
anes  can  be  needed.^The  most  useful  ones  are  probably  rexec,  rmgr,  state_xfer,  roord- 

cotwrt,  sons,  and  update.  Moat  toolkit  mechanisms  are  orthogonal  to  the  others,  although  this 
maybe  hard  to  see  until  you  get  some  experience  using  the  system.  Abo,  same  work  will  be 
n«ded  on  parts  of  the  system  concerned  with  recovery  before  this  can  be  done  transparently 
A*?”  development  strategy  is  to  start  with  the  system  design  for  the  “normal”  case,  then 
add  code  ,  or  recovery  from  failures  and  dynamic  reconfiguration  -  it  won’t  change  your  original 
rode  much  at  all.  The  evolution  of  the  twenty  questions  program  is  illustrative  of  firis  deaim  phi¬ 
losophy.  r 
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Group  view  database 


Recovery  actions  database 


RMGR 


REXEC 
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<site  view> 
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1.  Synopsis 

A  discussion  of  addressing  in  the  ISIS  system. 


#indude  <isis/d.h> 

3.  Type  definitions 

From  the  bottom  up,  ISIS  knows  about  sites,  site  lists  and  views,  process  addresses ,  group 

addresses,  and  group  views.  There  is  also  a  notion  of  remote  sites,  processes,  and  groups,  which 

can  only  be  accessed  using  special  protocols. 

1.  Currently,  ISIS  only  supports  local  sites.  A  local  site  is  identified  by  a  two-byte  «*]»«** 
consisting  of  a  site-number,  in  the  range  1-127,  and  a  site-incarnation,  also  in  the  range  1- 
127.  This  sequence  is  referred  to  as  a  site-id  in  the  ISIS  system,  hi  future  versions  of  the 
system,  site-numbers  will  be  expanded  to  indude  a  concept  of  local  sites,  long-distance  sites, 
and  remote  shea,  hi  fins  extension,  site-id’s  will  be  4  bytes  long-  Two  bytes  will  represent 
the  cluster  number,  one  byte  the  site-number,  and  one  byte  the  site  incarnation  number. 
Local  sites  will  have  a  duster  number  of  0.  Long-distance  sites  are  intended  to  represent 
sites  within  the  same  geographical  area  but  accessible  at  somewhat  higher  cost,  e.g.  through 
a  gateway.  The  duster  number  for  these  sites  will  be  in  file  range  0-127.  Remote  sites  are 
assumed  to  be  accessible  only  over  genuinely  long-distance  connections  and  will  use  a 
hierarchical  numbering  scheme.  A  remote  site  number  win  be  represented  by  duster 
number  in  file  range  128-255  and  must  be  mapped  through  a  mount  table  to  obtain  remote 
addressing  information,  using  a  method  that  is  at  present  unspecified,  hi  most  cases  special 
protocols  will  be  used  to  communicate  with  remote  sites.  Process  groups  win  not  be  allowed 
cross  long-distance  communication  boundaries,  but  mechanisms  far  Uniting  copies  of  a  group 
that  lives  on  both  sides  of  such  a  boundary  will  be  provided. 

The  number  of  sites  in  a  duster  is  intentionally  kept  small  to  control  the  costs  of  the  ISIS 
protocols.  The  actual  decomposition  of  rites  into  dusters  is  transparent  to  the  ISIS  user,  but 
can  affect  performance:  whenever  possible,  processes  that  interact  heavily  with  ane-anotber 
should  be  located  within  the  same  duster,  hi  addition,  it  is  undesirable  for  dusters  to  “parti¬ 
tion”  in  such  a  manner  that  commumcatian  between  two  subgroups  of  the  duster  is  tem¬ 
porarily  impossible.  For  example,  if  a  duster  contrins  a  single  gateway,  ISIS  may  block 
(hang)  during  periods  when  the  gateway  is  down.  Has  problem  can  be  circumvented  by 
introducing  redundant  communication  gateways  whenever  possible. 

Several  pre-defined  macros  allow  one  to  extract  the  fields  from  a  sitejd  rid:  STTg_NO(rid), 
STTEJNCARN(sid)  and  STTE_CLUSTER(sid).  The  macro  MAKE_SrTFJD(rite-no, 
incarn)  can  be  used  to  create  a  local  site-id. 

2.  A  site-list  is  a  list  of  rite-kf  s  terminated  by  a  null  rite-id.  Note  that  STTE_NO(sid)  is  mill 
only  for  a  null  site- id.  This  is  useful  when  scanning  the  elements  of  a  site-list. 

3.  A  site  view  consists  of  a  list  of  rite-id’s  and  sssodated  information  maintained  by  the  system 
failure  detection  module  for  a  single  ISIS  duster.  In  particular,  a  view  has  a  view-id 
number,  and  all  sites  in  a  duster  observe  the  same  sequence  of  views.  It  can  be  assumed 
that  the  sites  in  a  site- view  are  listed  in  order  of  age  (oldest  first)  and  that  all  observers  see 
the  same  sequence  of  site-views.  See  SVIEW(TK)  and  VSYNC(TX)  for  details. 

4.  A  process  address  in  ISIS  consists  of  a  site-id,  a  type  field  containing  the  constant  BAPID,  a 
unique  process-id  number  which  is  a  short  integer  used  by  the  operating  system  at  that  rite 
to  identify  a  process  running  on  its  site,  and  an  entry  point  within  that  process,  which  may 
be  null.  The  procedure  MAKE^ADDR£SS(site , incarn, pid, entry)  can  be  used  to  make  a 
process  address.  The  corresponding  field  names  are  site,  incarn,  process,  entry.  The  site 
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field  will  never  be  mill,  hence  this  is  sometimes  used  to  detect  the  end  of  a  list  of  addresses. 
Certain  predefined  id-numbers  are  used  to  identify  system  processes.  For  example,  the 
defined  symbols  PROTOCOLS,  REXEC,  RMGR,  and  NEWS  are  automatically  mapped  to 
the  process-id  for  the  corresponding  service  at  a  given  destination  site.  Additional  system- 
wide  process  numbers  win  be  added  as  the  ISIS  system  evolves.  These  addresses  are  defined 
in  generic_address.lt. 

The  function  cmp_addressO  is  provided  to  facilitate  address  comparisons.  Invoked  as 
cmp_address(al,a2),  where  al  and  a2  are  pointers  to  addresses,  this  returns  0  if  al  and  a2 
are  the  same,  a  negative  number  if  al  and  a2  differ  and  al  is  “smaller”,  and  a  positive 
number  if  al  is  larger.  Thus,  although  most  users  would  just  compare  the  result  with  0, 
cmp.addressO  is  compatible  with  the  standard  UNIX  qukksortO  utility.  The  caUer  of 
anp_address  should  be  aware  that  an  address  with  the  entry  field  specified  as  0  is  treated 
specially:  such  an  entry  is  a  wild-card  that  wfll  match  any  other  entry  value.  Two  addresses 
with  non-zero  entry  numbers  must  match  exactly,  however. 

5.  A  group  address  is  an  address  used  to  identify  a  process  group  in  ISIS.  Such  an  address  con¬ 
sists  of  a  site-id  for  the  site,  a  type  field  containing  the  constant  ISAGID,  a  16-bit  group  id, 
and  an  entry  point  that  must  be  die  same  for  ail  members  of  the  group.  Notice  that  group  and 
process  addresses  both  have  the  same  format;  if  desired,  a  process  address  may  be  thought 
of  as  a  group  containing  one  member.  Group  addresses  are  created  using  the  pg_createO 
request,  but  because  of  subsequent  join,  leave  and  failure  events  the  group  may  subsequently 
migrate  to  other  sites  in  the  system.  Consequently,  group  addresses  are  usually  obtained 
using  pgJookupO.  This  implies  that  the  site-id  in  the  group  address  is  not  necessarily  useful 
for  determining  where  members  of  the  group  reside  (but  see  also  PGROUPS(TK)). 

4.  Entry  points 

Each  process  in  the  ISIS  system  is  understood  to  accept  messages  at  a  variety  of  entry  points.  An 
entry  point  is  a  one-byte  unsigned  integer..  Some  entry  points  have  standard  values: 
GENERIC_RCV_REPLY  is  the  entry  point  to  which  a  reply  message  can  be  sent, 
GENERIC_TK_CHKPT  is  the  entry  point  used  by  the  checkpoint  toolkit  routine  to  trigger  a 
checkpoint,  etc.  These  are  defined  in  genericLAddress.h,  which  is  automatically  included  when 
"d.h”  is  included  into  a  program.  In  addition,  each  process  can  define  additional  entry  points  of 
its  own.  To  avoid  accidental  conflict  with  these  generic  addresses,  these  user-defined  entry  points 
should  be  assigned  entry  numbers  greater  than  or  equal  to  USER_BASE,  a  constant  also  defined 
in  that  file.  Notice  that  different  processes  can  interpret  the  same  entry  “number”  in  different 
ways. 

A  process  declares  its  entry  points  by  calling  the  routine 

isis_entry(entry-pomt-number,  routine,  "printable  name"); 

Many  of  the  toolkit  routines  install  their  own  handlers  (for  the  GNEERIC  entries)  when  isisJnitO 
is  called.  On  arrival  of  a  message,  the  corresponding  entry  will  be  invoked  as: 

routine(mp) 

message  *mp; 

{ 

} 

The  message  is  automatically  deleted  after  the  routine  terminates,  unless  msgJncrefcountO  has 
been  called  prior  to  returning.  If  a  message  arrives  in  a  process  and  the  prooets  has  not  specified 
a  routine  to  handle  messages  to  the  specified  entry  point,  the  message  is  discarded  and  an  error 
message  is  printed  on  the  stderr  output  channel. 

See  FGROUPfIK)  for  information  on  manipulating  process  groups,  ENTRY(TK)  for  more  infor¬ 
mation  on  entry  points,  MESSAGES(TK)  for  more  information  on  messages,  and  BCAST(TK) 
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far  information  an  sending  messag  js  to  the  members  of  one  or  mare  process  groups. 

5  DEM"*  I—  I,  — 

•  ktl  miefcmii 

An  RPC  style  interaction  occurs  when  a  process  sends  a  message  to  another  process  awaits  a 
reply.  ISIS  supports  this  mode  of  interaction,  and  will  even  provide  stub  generators  to  compile 
from  a  “mce"  looking  RFC  syntax  into  die  message  generation  and  unparfring  mechanisms  needed 
to  map  this  into  the  above  facility.  To  identify  the  RPC  “session”,  a  session-id  number  is  placed 
in  the  message  at  the  time  it  is  sent  (see  MESSAGES(TK)).  The  sending  task  then  blocks  await¬ 
ing  a  reply  with  this  session-id  number;  session-id  numbers  are  32-bit  integers  and  should  not  be 
re-used.  Thus,  a  pending  RPC  has  an  address  consisting  of  the  address  of  the  caller  process 
together  with  the  id  of  the  session.  To  send  a  reply,  the  replying  task  creates  a  message  contain¬ 
ing  the  reply  value  (field  name  FLD_ANSW),  the  length  erf  this  field  (FLDlALEN)  and  the 
session-id  number  identifying  the  session  (FLD_SESSON),  and  then  transmits  this  mewoge  to  the 
sender  of  the  RPC  In  general,  ISIS  does  not  assume  that  it  is  an  error  to  send  the  same  reply 
more  than  once  or  to  send  multiple  replies  to  a  task  that  expected  Just  one  reply.  In  these  cases, 
the  superfluous  replies  are  discarded  silently. 

6.  ft  latfug  an  address 

The  routine  paddr(addr)  will  print  die  address  pointed  to  be  addr;  paddrs(alist)  will  print  die 
members  of  a  null-terminated  address  Hst,  and  psite(sid)  wiD  print  the  site  ram«»  and  incarnation 
for  a  site=id.  Whenever  possible,  entry  numbers  are  printed  in  their  text  form,  but  if  paddrO  is 
called  in  a  place  that  just  doesn’t  know  the  text  form  for  an  entry  point,  the  numeric  version  is 
printed  instead.  This  is  true  for  process-id's  too. 

7.  Sits  names 

The  array  site_namesQ  gives,  far  a  site-id,  the  printable  name  of  that  site.  These  names  are  actu¬ 
ally  taken  from  a  file  used  during  startup  of  the  system  (see  FILES(TK),  STARTUP(TK)). 
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1.  Synopris 

A  mechanism  for  restricting  access  to  a  group. 

2.  Interface 

#indude  < isis/d. h> 

auwrequestwverify(routine) 

int  (*routine)0; 

au_permit(who) 
address  who; 

au_/evoke_perm(who) 
address  who; 


Normally,  any  process  in  possession  of  the  address  of  a  group  can  issue  caDs  to  that  group,  ftw. 
applications  will  need  more  protection  than  das,  however,  and  the  authentication  tool  gives  them 
that  option. 

To  enable  the  tool,  call  au_request_verify(routine),  giving  a  routine  that  will  verify  the  legality  of 
requests  from  unknown  caDers.  The  routine  is  invoked  as: 

routine(mp) 
message  *mp; 

{ 

> 

and  should  return  0  if  die  request  is  legal.  A  reply(mp, 0,0,0)  is  sent  by  the  authentication  service 
if  the  routine  returns  -1.  No  reply  is  seat  if  the  routine  returns  some  other  value.  To  avoid 
unnecessary  work,  the  routine  au_pennit()  can  be  called  to  indicate  that  die  fVdgnateH  caller  is 
permitted  to  send  arbitrary  requests  to  this  process.  All  messages  from  that  process  will  be 
allowed  through.  auurevoke_pennO  removes  an  address  from  the  privileged  caller  list;  subsequent 
messages  from  that  process  will  be  passed  through  the  verification  procedure. 

The  verification  procedure  is  permitted  to  call  tJork.delayed,  tJork_urgem,  t_*ig_delayed  and 
Urig_urgent,  but  may  not  cmD  t_w*t  or  try  to  do  an  RPC  or  broadcast.  This  is  because  it  is  run 
from  the  main  thread  of  control  -  not  as  a  task. 
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1.  Synopris 

A  package  of  routines  implementing  distributed  bulletin  boards  as  described  in  our  ti^hnirai 
report.  These  routines  will  be  implemented  during  the  summer  or  fall  of  1987.  The  interface  will 
be  a  subroutine  one  but  otherwise  very  to  the  one  discussed  in  the  paper.  Initially,  only  C 
will  have  access  to  the  bboard  facility,  but  versions  for  other  languages  (especially  LISP)  will  be 
provided  eventually. 
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1.  Synopsis 

A  package  of  routines  implementing  distributed  broadcasts  of  various  flavors,  ami  with  a  variety 
of  destination  addressing  modes. 

2.  Interface 

#indude  < isis/d. h> 

isisJnit(O); 


/•  Broadcast  to  a  list  of  addresses  •/ 
nresp  -  BCAST(alist,  msg,  nwanted,  answ,  atype,  alen,  rlist) 
address  ’alist,  ’rlist; 
message  ’msg; 
char  *answ; 

/•  Broadcast  to  everyone  on  a  list  except  the  sender  •/ 
nresp  =  BCASTJEX(alist,  msg,  nwanted,  answ,  atype,  alen,  rlist) 
address  ’alist,  ’rlist; 
message  *msg; 
char  ’answ; 

/’  Reply  to  a  broadcast  or  RPC  request  */ 
reply(msg,  value,  type,  len) 
message  *msg; 
char  ’value; 

/*  Reply,  sending  a  copy  to  other  processes  */ 
reply_cc(msg,  alist,  value,  type,  len) 
message  *msg; 
address  ’alist; 
char  ’value,  ’fvalue; 

/*  Flush  any  asynchronous  messages  */ 

FLUSHQ 

Above,  BCAST  is  an  unordered  (but  reliable)  protocol,  and  is  not  actually  used  very  often  in  ISIS. 
You  can  substitute  CBCAST  to  obtain  the  causal  broadcast,  ABCAST  for  the  atomic  broadcast, 
and  CBCAST  for  the  strongly  ordered  group  broadcast  protocol.  Section  4,  below,  discusses  the 
way  this  choice  would  normally  be  made. 


3.  Dbcnmton 

In  each  case,  the  addressing  information  is  used  to  determine  a  set  of  destination  processes  (an 
address  list)  to  which  the  message  is  delivered.  On  reception  of  a  message,  this  information  will 
be  present  in  its  dests  field  (see  msg_getdests()  in  MSG_EDIT(TK)).  The  protocol  waits  until 
nwanted  replies  are  collected,  or  until  it  has  as  many  replies  as  possible,  and  then  returns  the 
number  of  replies  and  a  vector  containing  the  replies  themselves.  The  address  of  the  sender  who 
supplied  the  i’th  answer  will  be  saved  in  rlist[i)  if  rlist  is  non- null,  and  is  discarded  otherwise. 

If  nwanted  is  0,  the  message  will  be  sent  asynchronously.  That  is,  the  caller  can  continue  execut¬ 
ing  before  the  message  is  delivered,  although  there  will  be  a  delay  even  in  this  case  while  the  mes¬ 
sage  is  passed  to  the  ISIS  protocols  process.  A  message  is  said  to  be  synchronous  if  the  caller 
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blocks  waiting  for  one  or  mare  replies,  although  obviously  there  is  a  range  of  degrees  of  syn¬ 
chrony:  a  caller  that  waits  for  one  reply  will  be  running  much  more  asynchronously  than  one  that 
waits  for  replies  from  all  destinations.  Asynchronous  execution  is  always  much  faster  than  syn¬ 
chronous  execution;  the  more  synchronous,  the  higher  the  performance  cost. 

A  reply  is  specified  as  a  pointer  to  die  data  in  question  (it  will  be  copied  to  a  safe  place  during  the 
call),  the  type  (see  MESSAGES(TK)),  and  the  length  of  the  data  hem  in  bytes.  Replies  to  a  mes¬ 
sage  are  sent  using  replyO  or  repfy_ocQ  (which  sends  copies  of  die  reply  to  some  other  set  of 
processes).  If  the  caller  has  specified  All.  for  nwanted  it  is  still  possible  for  a  recipient  of  the 
message  to  refuse  to  reply;  this  is  done  by  calling  a  reply  with  a  null  answ  pointer  and  a  0  alen. 
The  reply  is  also  permitted  to  be  shorter  than  the  length  specified  in  the  broadcast. 

For  example,  the  fancy  twenty-questions  program  described  in  the  XFERfTK)  documentation  uses 
a  reply(mp,(char*)0,0,0)  when  one  of  its  hot  standby  processes  gets  a  request  message. 

The  addressing  rules  used  by  the  broadcasts  are  relatively  subtle.  The  basic  idea  is  this: 

a)  BCASTO  sends  to  die  processes  and  process  group  members  listed  in  the  null-terminated 
address  list.  It  is  assumed  that  die  “entry”  field  of  each  address  in  the  list  has  been  set  to 
the  entry  number  to  which  the  message  should  be  delivered  (see  ADDRESSING(TK)) .  If 
this  is  a  standard  entry  and  hence  the  entry  number  will  always  be  the  same,  the  routine 
set_entry(alist, value)  cm  be  used  to  set  all  entry  numbers  in  the  aUst  to  the  tfewgnutoH  value 
(for  convenience,  set_entry  returns  its  aKst  argument). 

b)  BCAST _EXQ  is  like  BCASTO,  except  that  if  the  sender  is  a  member  of  the  address  list  it 
will  be  excluded  from  the  actual  delivery.  This  is  useful  when  an  asynchronous  broadcast  is 
to  be  sent  to  die  remote  managers  far  some  distributed  resource  after  die  local  copy  has 
already  been  updated. 

What  makes  addressing  complicated  is  that  ISIS  makes  a  distinction  between  process  groups  that 
are  directly  accessible  by  a  process  and  those  that  it  can  only  access  indirectly.  A  group  is  directly 
accessible  by  any  of  its  members,  plus  any  additional  processes  that  a  group  member  has  added  to 
the  group  view  using  pg_adddientO  (see  also  PGROUPS(TK)).  Alists  as  described  above  cm 
only  be  used  if  all  the  process  groups  in  the  alist  are  directly  accessible  (other  processes  may  be 
explicitly  listed  too).  If  a  message  is  to  be  sent  to  a  process  group  that  is  not  directly  accessible,  the 
alist  must  only  contain  one  entry  -  die  group  address.  Thus,  broadcast  addressing  is  far  more  flexi¬ 
ble  in  the  case  of  directly  accessible  addresses.  To  make  matters  worse,  CBGASTO  doesn’t  work 
correctly  if  invoked  asynchronously  from  a  process  that  cm  only  access  the  destination  indirectly. 
Thus,  in  the  case  of  indirect  access,  CBCAST  should  only  be  used  synchronously  (waiting  far 
responses  from  one  or  more  destinations).  This  limitation  will  be  climhiated  in  a  future  release  of 
ISIS. 

If  a  broadcast  is  invoked  with  nwanted  equal  to  0,  or  if  several  broadcasts  are  done  concurrently 
by  different  tasks  within  a  single  process,  the  issue  arises  of  bow  to  ensure  that  they  have  ter¬ 
minated  before  taking  some  action  that  might  leave  m  externally  visible  trace.  Otherwise,  should 
a  failure  cause  one  of  these  protocols  to  abort,  the  external  state  of  the  system  might  be  incon¬ 
sistent  with  the  state  left  by  the  failure.  The  FLUSHQ  primitive  should  be  invoked  for  this  pur¬ 
pose.  It  blocks  until  all  pending  broadcasts  are  completed  and  then  permits  the  caller  to  resume 
computation  normally. 

4.  Picking  the  right  flavor  of  braadcaat 

In  most  cases,  CBCAST  should  be  specified  as  the  broadcast  primitive;  this  is  the  cheapest  proto¬ 
col  in  ISIS  and  it  is  highly  advantageous  to  use  it  whenever  possible.  However,  some  replicated 
data  structures  and  algorithms  need  the  stronger  ordering  that  ABCAST  and  GBCAST  provide, 
and  there  is  no  very  simple  way  to  explain  how  one  identifies  these  applications.  The  basic  rule  is: 
CBCAST  is  used  when  messages  from  other  processes  that  happen  to  arrive  at  the  same  time  as 
your  broadcast  will  be  serviced  the  same  way  regardless  of  whether  your  message  arrives  first  or 
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second  For  example,  requests  to  read  a  replicated  database  or  for  some  other  simple  service 
would  be  transmitted  using  CBCAST:  these  have  no  effect  at  aS,  and  databases  are  usually  locked 
to  prevent  reads  while  they  are  being  updated.  ABCAST  is  used  when  requests  will  be  queued  or 
otherwise  applied  to  a  replicated  data  structure  that  would  return  different  results  after  a  Mem* 
of  updates  depending  on  the  order  in  which  they  were  done.  A  FIFO  queue  has  this  behavior,  but 
a  B-tree  or  a  file  normally  would  not.  Thus,  one  would  normally  use  ABCAST  when  talking  to  a 
FIFO  queue  and  CBCAST  when  talking  to  a  B-tree  manager.  GBCAST  is  used  to  obtain  a  con¬ 
sistent  cut  across  the  system,  an  operation  that  is  only  needed  in  certain  highly  sophisticated  algo¬ 
rithms.  If  you  are  concerned  that  your  application  may  have  an  order  sensitivity,  it  should  still 
suffice  for  you  to  use  ABCAST.  ABCAST,  however,  is  slower  than  CBCAST.  More  discussion 
of  this  choice  appears  in  VSYNC(TK). 

5.  RPC  mechanism 

ISIS  does  not  currently  support  the  sort  of  argument  packaging  that  is  common  in  RPC  services 
such  as  the  SUN  RPC  service  or  the  CEDAR-MESA  one.  However,  h  will  shortly.  In  the  mean¬ 
time,  to  do  an  RPC,  the  arguments  should  be  packaged  in  a  message,  and  then  one  of  the  broad¬ 
casts  used  to  transmit  this  message  to  its  destination  (that  is,  the  ahst  should  specify  a  single  desti¬ 
nation).  Set  wanted  to  1  and  wait  for  the  answer.  It  is  easy  to  generalize  this  to  a  group- RPC 
(set  nwanted  to  ALL)  or  a  quorum-oriented  RPC  (set  nwanted  to  the  quorum  size).  The  RPC  will 
return  when  nwanted  responses  are  received  and  any  extra  responses  wiO  be  discarded.  Within 
ISIS  itself,  we  use  a  combination  of  these  methods. 

6.  What  can  yon  conrfarla  about  a  Britan? 

If  a  broadcast  routine  is  asked  to  wait  for  a  reply  from  some  process,  but  it  returns  without  that 
reply,  the  process  has  failed  or  simply  doesn’t  exist.  You  can  actually  conclude  a  bit  more  about 
the  system  state  than  this,  however,  and  understanding  what  you  can  assume  will  simplify  your 
code. 

First,  it  is  safe  to  assume  that  you  won’t  see  any  more  actions  or  messages  initiated  by  the  failed 
process.  For  example,  if  it  was  supposed  to  send  a  reply  (or  a  reptyjoc),  either  the  reply  gets 
through  first,  to  all  destinations  in  the  reply jx  case,  or  the  reply  just  won’t  arrive  anywhere,  ever. 
This  is  also  true  when  the  process  might  have  been  using  a  toolkit  routine  at  die  time  it  crashed. 
For  example,  if  a  process  may  have  failed  while  pg_addmembO  was  running,  either  the  new 
member  will  have  been  added  before  the  failure  is  detected,  or  the  failure  win  be  detected  first  in 
which  case  the  pg_addmemb()  will  not  take  place  -  tins  is  because  the  addmrmhQ  algorithm  in 
ISIS  is  based  on  GBCAST,  and  either  the  GBCAST  is  delivered  before  the  failure  is  announced, 
or  not  at  all. 

Thus,  if  you  didn't  get  some  message  and  the  process  that  is  supposed  to  send  it  is  observed  to 
fail,  you  won’t  get  it  -  and  win  did  any  one  else  who  would  have  received  a  copy.  This  is  a  pro¬ 
perty  caDed  broadcast  atomicity. 

Also,  the  failed  process  wffl  vanish  from  any  process  group  views  in  which  it  was  listed,  and  if  the 
broadcast  used  one  of  these  groups  as  a  destination,  the  failed  process  drops  out  before  the  broad¬ 
cast  returns  to  its  caller.  Hie  process  may  not,  however,  have  been  dropped  from  other  groups  to 
which  the  message  was  not  sent.  This  is  because  there  could  be  a  delay  between  when  view  one 
gets  changed  and  when  another  does. 

For  example,  assume  that  process  ’p’  is  a  member  of  groups  gl  and  g2.  Some  other  process,  ’q’ 
broadcasts  to  gl  and  waits  far  replies  from  all  its  destinations.  If  ’p’  fails,  process  ’q’  win  defin¬ 
itely  find  that  ’p’  has  dropped  from  group  gl.  On  the  other  hand,  ’p’  may  still  be  listed  as  a 
member  of  group  g2,  which  was  not  a  destination  of  the  broadcast. 
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1.  Synnpdi 

Some  simple  routines  for  manipulating  vectors  of  bits. 

2.  Intaha 

#indude  "d.h" 

#  define  MAXBTTS  32  /*  Multiple  of  32  */ 

bis(vec,  b) 
bitvec  vec, 
intb; 

bic(vec,  b) 
bitvec  vec; 
int  b; 

bit(vec,  b) 
bitvec  vec; 
int  b; 

bisv(vecl,  vec2) 
bitvec  vecl,  vec2; 
int  b; 

bicv(ved,  vec2) 
bitvec  vecl,  vec2; 

bandv(vecl,  vec2) 
bitvec  vecl,  vec2; 

bitv(vecl,  vec2) 
bitvec  vecl,  vec2; 

bdr(vec) 
bitvec  vec; 

bset(vec) 
bitvec  vec, 


These  routines  support  vectors  of  bits  of  arbitrary  length  and  are  used  by  ISIS  to  implement  the 
sv_f ailed  and  sv_recovered  parts  of  a  site-view.  They  implement  32-bit  vectors  as  long  integers 
and  longer  vectors  as  diaracter  arrays.  They  can  be  used  for  other  purposes,  but  you  may  be 
farced  to  use  the  same  value  for  MAXBTTS  as  the  remainder  of  the  system  if  you  indude  d.h  in 
the  file  that  employs  the  bitvectars.  The  routines  will  set/dr/test  a  single  bit,  set/dear/andftest  bit 
by  bit  between  two  vectors,  dear  an  entire  vector  (set  bits  to  0),  or  set  an  entire  vector  (set  its  bits 
to  1). 
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How  to  compile  in  Isis  dient  program. 


In  *»  ‘isu/dient/libl.a  11 
In  -s  1sia/dieat/lib2.i  12 
In  *s  *isia/msg_edit/m]ib.a  13 
cc  -c  -Esia/dient  dient_prog.c 
cc  -o  dient_prog  dienUjrog.o  1? 


3.  Wi 


ISIS  uses  ■  number  of  global  variables,  and  it  is  obviously  a  bad  idea  to  re-use  the  same  variable 
names  far  same  other  purpose.  We  try  to  use  names  like  d_...  or  pr_....  to  avoid  likely  conflicts, 
and  to  declare  our  variables  to  be  static  whenever  possible,  but  some  care  is  certainly  required. 
Many  of  these  global  variables  are  declared  in  d.h.  Eventually,  we  plan  to  dean  this  up  and  will 
also  provide  a  list  of  global  variables  and  what  they  are  used  for  below. 

Shortly,  the  library  called  ml&.a  win  be  merged  with  tib2.a.  The  use  of  two  libraries  is  an  una¬ 
voidable  consequence  of  the  way  RANLIB  works.  The  first  (Kbl.a)  contains  toolkit  routines,  and 
the  second  contains  the  remainder  of  the  dknt->isa  interface  code. 


3.1.  Toporary  SUN  veniaa 

We  have  some  ideas  on  how  to  reorganize  the  system,  but  for  now  the  various  libraries  come  in 
two  versions.  The  one  shown  above  is  for  the  gould.  Ota  the  SUN,  everything  is  the  same  except 
for  the  dient  directory,  whidi  is  renamed  "klient",  and  the  message  edft  library,  which  is  renamed 
sunjnUb.a.  This  situation  wiD  go  away  very  soon. 
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1.  Syuopfis 

A  toolkit  routine  for  managing  configuration  information 
#indude  <isis/d.h> 

config_update(gid,  namel,  datal,  typel,  leal,  name2, ...  ,  0); 
address  gid; 
char  *  namel,  *  datal; 
int  typel,  leal; 

char  ,config_get(gid,  mane) 
address  gid; 

‘name; 


Some  applications  w£D  need  to  tfivide  up  tasks  using  appIkatioB  specific  rules  that  change  dynami¬ 
cally.  The  configuration  tool  makes  das  easy,  requiring  only  that  the  configuration  updates  be 
done  by  members  of  das  group  to  which  the  configuration  applies,  not  outside  “diems”.  The  rule 
should  be  represented  using  one  or  more  data  structures;  multiple  structures  would  be  used  in 
same  cases  because  of  the  need  to  specify  "type"  information  using  the  mechanism  supplied  by 
die  message  editing  routines  (see  MESSAGES(TK)).  To  update  the  configuration  structure,  use 
oonfigjipdate.  When  a  message  arrives,  aO  recipients  should  call  coufig_get  to  query  the  structure 
and  aO  wfll  see  the  same  values  in  it  far  any  given  message.  You  should  copy  these  values  to  the 
side  if  you  (dan  to  look  at  them  after  doing  something  that  could  block  (an  RFC  or  broadcast,  a 
t_wait0,  etc);  them  values  can  change  while  a  task  is  asleep.  Configurations  are  immtained  on  a 
per-group  basis;  the  same  field  may  have  different  value*  and  meanings  for  two  dttferent  groups 
even  if  the  some  prognm  belongs  to  both  groups. 

It  is  costly  to  update  configurations,  especially  if  the  same  configuration  is  updated  concurrently  by 
multiple  group  members.  Therefore,  such  behavior  must  be  avoided  whoever  possible. 
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1.  Syneptis 

A  routine  implementing  coordinator-cohort  computations. 

2.  Interface 

#indude  < isis/d. h> 

isisjnit(0); 


r  Run  a  coordinator-cohort  computation  */ 
coorcLcohort(msg,  gid,  alist,  action,  rtype,  rlen,  got_reply) 
message  *msg; 
address  gid,  *  alist; 
char  (*action)0; 
int  Cgot_jeply)0; 

P  Figure  out  who  the  coordinator  should  be  */ 
address  coordinatdr(gid,  sender,  affit) 
address  gid,  sender,  ‘alist; 

Many  ISIS  algorithms  are  based  an  the  idea  of  having  one  process  (the  coordinator)  take  some 
action  while  others  (its  cohorts)  monitor  it  and  take  over  in  the  event  of  failure.  Although  this 
can  be  implemented  several  ways,  we  picked  a  simple  scheme  and  provided  it  in  die  coordinator- 
cohort  toolkit  facility.  Since  dm  notion  of  picking  a  coordinator  for  a  task  is  somewhat  more  gen¬ 
eral  than  the  notion  of  running  a  coordinator-cohort  computation,  the  routine  we  use  to  pick  the 
coordinator  is  also  documented  here. 

A  coordinator-cohort  interaction  starts  when  a  cheat  issues  a  request  to  some  group  of  processes 
using  a  broadcast.  The  cheat  typically  blocks  waiting  for  a  single  reply,  which  may  come  from  any 
of  the  destination  processes.  The  recipients  of  das  message  all  invoke  coorcLcobart  with  the  fol¬ 
lowing  arguments:  the  message,  the  address  of  a  routine  that  will  take  the  desired  action,  the 
processes  at  which  the  computation  is  running,  and  the  length  and  type  of  die  result  (see 
MESSAGESflK)  and  MSGJEDir(TK)).  The  list  of  proem  as  should  be  in  the  same  order  at 
every  process  that  invokes  the  coord-cohort  routine;  this  is  farftitated  by  the  fact  that  the  message 
destinations  (see  msg_getdests()),  the  members  of  a  process  group  (pg  gctview(gid)->pg  shat) 
and  the  sites  in  a  site-list  (sl_jetviewO->sljdist)  have  the  same  values  in  all  of  these  processes 
when  the  message  arrives.  See  VSYNC(TK)  if  this  concept  confuses  you.  Basically,  the  idea  is 
that  messages  in  ISIS  seem  to  arrive  simultaneously  at  aD  destination  processes. 

The  coordinator  site  will  be  the  site  where  the  message  was  sent  if  one  of  the  processes  in  the  alist 
retides  at  that  site,  and  mdomly  chosen  otherwise.  The  other  processes  are  cohorts  and  are 
ranked  using  a  fairly  random  algorithm  based  an  site-id  numbers.  The  processes  in  the  alist  must 
all  be  members  of  the  group  designated  by  gid. 

At  the  coordinator,  the  action  routine  is  invoked  as  action(msg , gid,how) ,  where  gid  is  the  group  id 
from  the  coord-cohort  request  and  how  will  be  the  constant  CCLCOORD,  defined  in  <isitid.h>. 
The  coordinator  routine  should  execute  the  reqtmt  and  compute  a  result,  storing  it  in  an  area  of 
memory  allocated  with  maDoc.  It  should  then  return  a  (char  *)  pointer  to  this  area.  This  result 
will  be  sent  to  the  caller  using  a  replyO  mechanism  but  will  also  be  transmitted  to  the  cohorts, 
where  the  gotjeply  routine  is  invoked  as:  gotjreply(casg,  remit,  rlen).  The  type  field  is  used  in 
generating  the  reply  message,  but  is  not  passed  as  an  argument  to  the  got_reply  routine.  The  msg 
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argument  is  the  one  to  the  original  coorcLcohort  call.  The  memory  that  the  result  occupied  is 
automatically  freed  when  no  longer  needed. 

In  the  event  of  a  coordinator  failure,  one  of  the  cohorts  will  take  over  and  restart  the  action.  The 
restart  invocation  is  identical  to  the  initial  action  invocation  except  that  how  win  now  be  equal  to 
CC_COHORT.  No  dean-up  actions  will  have  been  taken;  the  cohort  is  responsible  for  this  if  any 
are  needed.  If  all  of  the  processes  in  the  alist  fail,  the  caller  receives  a  failure  indication  from  the 
original  BCASTO  that  triggered  the  execution  of  the  algorithm  -  specifically,  the  BCASTQ 
returns  0  (no  replies)  instead  of  1  (the  single  reply  the  caller  wanted). 

The  routine  ooordinator(gid,  sender,  alist)  picks  a  coordinator  and  returns  its  address.  It  returns 
NULLADDRESS  if  every  process  in  the  alist  is  down.  The  coordinator  wiQ  be  an  operational 
member  of  alist  in  die  current  view  of  gid subject  to  the  following  rule: 

1.  If  some  process  in  the  alist  is  at  the  same  site  as  the  sender,  the  coordinator  will  be  picked 
relative  to  this  process. 

2.  Otherwise,  die  coordinator  is  picked  relative  to  proem  afistfkj  where  k  -  sender. site  nod 
length(alist). 

3.  Now,  given  a  starting  point,  entries  in  alist  are  evaluated  one  by  one,  and  the  first  one  that 
is  listed  in  the  current  view  for  gid  is  returned.  NULLADDRESS  is  returned  if  aD  processes 
in  alist  are  tested  and  none  is  operational. 
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1.  SjDepk 

Declaring  the  routine  that  wffl  service  requests  to  a  given  entry  point. 


#indude  "d.h" 

isis_entry(code,  routine,  name) 
int  code; 
int  (•routine)0; 
char  *rname; 


When  starting  up,  a  program  should  bind  routines  to  the  entry  codes  that  it  will  accept  in  mes¬ 
sages  it  receives.  The  toolkit  routines  do  this  for  the  generic  addresses  when  isis JnitO  is  invoked. 
Once  defined,  it  is  illegal  to  redefine  the  generic  entry  points,  although  user  entry  points  can  be 
rebound  as  desired.  This  is  to  prevent  users  from  unintentionally  screwing  up  the  toolkit  routines. 
Entry  points  are  bound  by  ealH^g  isis_entry  and  specifying  the  numeric  code,  the  routine  address, 
and  a  printable  name  corresponding  to  this  routine.  The  generic  entry  codes  are  defined  in 
genericL.addreas.h;  these  are  standard  for  all  ISIS  diems.  Otter  entry  codes  can  be  defined  on  a 
per -client  basis  starting  with  the  number  USERJ3ASE.  Codes  need  not  be  allocated  sequentially 
and  different  applications  can  use  the  same  entry  code  for  totally  different  purposes. 

4.  Intercepting  •  masaaga 

It  is  possible  to  intercept  and  examine  messages  before  they  reach  the  entry  handling  routine.  See 
FILTER(TK)  for  detail. 
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1.  Synopsis 

Description  of  files  used  when  starting  ISIS  up  at  a  site  and  the  program  used  to  start  the  system 
up. 


2.  He  summary 

sites:  Lists  the  sites  that  are  running  this  time 
restart:  Tells  what  programs  to  restart  automatically 

3.  Starting  Ha  up  at  a  rite 

To  run  isis  at  a  set  of  sites,  first  create  a  "sites"  file  in  the  following  format:  a  ’+’  or  a  (lines 
with  a  minus  are  ignored),  site-number  (must  start  with  1),  a  colon,  three  numbers  indicating  the 
internet  ports  far  the  isis  sites  to  talk  to  each  other,  to  use  when  restarting,  and  to  talk  to  clients, 
the  site  name,  and  if  multiple  isis  systems  are  run  on  one  site,  a  T  followed  by  a  number.  The 
part  numbers  can  aO  be  0  if  the  /etcfoervices  file  is  set  up  to  support  isis  far  your  system.  The 
third  of  these  numbers  is  the  one  isisJnitO  expects  to  be  passed.  Far  example,  a  typical  sites  file 
might  contain: 

+  1:1250,1231,1252  buOwmkte.cs.cornell.edu 
+  2:1250,1251,1252  kama.cs.corneQ.edu 

This  file  says  that  isis  wiD  be  run  with  two  sites  operational,  buOwinkle  and  kama.  The  port 
numbers  used  are  the  same  in  this  case,  because  buOwinkle  and  kama  are  two  different  machines. 
To  run  two  instances  of  isis  on  buOwinkle  use  a  sites  file  like  this  one: 

+  1:1250,1251,1252  buOwinkte.cs.cornefl.edu/1 
+  2:1253,1234,1253  buQwinkle.es. Cornell. edu/2 

Here,  the  port  numbers  had  to  be  different  because  afl  the  ports  are  to  be  used  at  one  site. 

The  restart  file  tells  what  programs  to  restart  when  isis  comes  up  at  a  site.  These  are  mostly  sys¬ 
tem  services,  like  the  remote  exec  service.  Here  is  a  typical  restart  file: 

/fs/moose/h/isis/jm>toa/protoe  <isis-protos>  -p 
/fsfaoose/b/isis/dient/rexec  <isis-rexec> 

/fs/moose/b/isis/dient/rmgr  <isit-rmgr> 

This  tells  the  system  to  run  the  protocols  process  (pretos),  the  rexec  program  (rexec),  and  the 
recovery  manager.  The  argument  “*p”  to  the  protos  process  is  required  when  isis  is  run  this  way. 
Extra  arguments  may  be  suppHed  by  the  isis  startup  program:  in  the  case  of  the  protos  program, 
-Sname  if  the  sites  file  name  is  not  the  standard  one,  •Hf#  if  the  site  has  a  sub-number,  and  in  the 
case  of  the  other  programs,  a  port  number  if  the  sites  file  specifies  something  for  the  dient-isis 
connection  port.  For  example,  using  the  second  sites  file  and  the  above  restartfQe,  the  protocols 
program  for  buDwinkle/2  wdl  be  invoked  as 

<isis-protos>  *H2 

and  the  rmgr  for  buHwinkle/2  wiD  be  invoked  as 
<uis-rmgr>  1255 

The  latter  command  is  telling  the  rmgr  program  to  call  isisjnit(1255)  to  connect  to  isis-protos. 

To  run  isis  at  a  site,  type 

isis  [-Sname]  [-Rname]  +  & 

If  the  sites  file  name  is  not  given,  “sites”  is  assumed.  If  the  restartfQe  name  is  not  given,  “restart- 
file"  is  ast «»"««<<  The  argument  tells  isis  to  bypass  the  site-failure  detector,  which  is  not  yet 
fully  debugged;  in  das  case,  all  sites  must  be  brought  up  more  or  teas  at  the  same  time  (so,  if 
using  buOwinkle  and  kama,  you  would  have  to  run  isis  on  both  at  the  tame  time,  by 
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hand!).  This  command  will  bring  up  as  many  instances  as  are  needed  for  the  local  host,  so  using 
the  second  site  file  it  will  automatically  start  isis  up  twice.  A  SUN/2  begins  to  perform  poorly 
when  it  must  support  more  than  2  isis  sites  at  a  time,  the  gould  somewhat  more  (4  to  6  max¬ 
imum). 

A  typical  program,  say  20-questions,  would  be  run  as 
twenty  dient-port 

giving  the  client  port-number  to  use  to  connect  to  isis,  or  just 
twenty 

if  the  /etc/services  file  is  set  up  correctly  to  know  about  isis.  Note:  It  takes  about  30  seconds  to  res¬ 
tart  isis  at  a  site.  If  a  program  tries  to  connect  to  isis  before  restart  is  finished,  various  errors  can 
result. 

To  kill  isis  at  a  site,  do  a 

and  kill  the  protos  process(es)  that  are  listed.  If  the  system  sashes,  the  reason  it  crashed  is  usu¬ 
ally  given  in  die  file  isis.log  (l.log,  2.log,  etc  if  several  run  on  one  system).  After  killing  isis,  a 
UNIX  bug  can  cause  the  ports  it  was  using  to  linger  for  30-seconds  to  a  minute.  If  you  wait  long 
enough  before  restarting  die  system,  these  will  normally  go  away.  If  they  don’t  there  is  nothing 
to  be  done  except  to  change  the  ports  isis  is  using  or  to  reboot  the  machine.  (This  is  obviously  a 
UNIX  bug.) 

4.  Restart  sequence 

The  figure  below  outlines  the  stages  of  restarting  isis  at  a  site.  As  seen  in  the  figure,  “isis"  starts 
by  ng  the  sites  file  and  then  trying  to  contact  isis  processes  elsewhere  in  the  network.  If  it 
finds  any,  the  system  restarts  by  joining  them.  Otherwise,  it  assumes  the  restart  is  from  a  total 
failure.  Next,  all  the  standard  system  programs  are  run,  and  then  the  recovery  manager  restarts 
|  user  programs  (see  RMGR(TK)). 
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<  total  restart  > 


Site  3  ,  incarnation  7 


Broadcast:  Is  ISIS  out  there ? 


Run  recovery  protocol 
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Scan  recovery  database 


partial  failure  action 


Monitor  ISIS  at  this  site 
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ISIS  failed 
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Signal  clients:  ISIS  has  crashed 


The  ISIS  Recovery  and  Monitoring  Sequence 
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1.  Synnpds 

A  mechanism  for  intercepting  arriving  messages,  used  internally  by  the  system. 


2.  Interface 

#indude  <isis/d.h> 


typedef  int  ifuncQ; 

if unc  ,isis_setfilterO,  old_fflter; 


oldJQta  ■  isis_setfilter(rcutine) ; 


(void)isis_setfiher(oldfilter) ; 


ISIS  has  several  tools  that  “fiber”  the  stream  of  messages  arriving  at  a  process.  For  example,  the 
state  transfer  tool  spools  messages  of  several  types  diving  transfers,  the  authentication  tool  vali¬ 
dates  the  legality  of  arriving  requests,  etc.  A  filter  works  by  interposing  itself  between  the  arriv¬ 
ing  message  queue  and  the  next  “lowest”  level  filter,  down  to  dJocaLdelivery,  which  is  the 
lowest  level  of  all.  A  filter  is  called  as  a  message  arrives,  and  can  inspect  it: 

roudne(mp) 
message  *mp; 

{ 

} 

Actions  available  to  a  filter  are  to  reject  the  message  (usually  by  sending  some  sort  of  reply, 
perhaps  a  reply(mp, 0,0,0),  or  to  pass  it  to  the  next  level  by  calling  old_filter(mp).  A  filter  may 
fork  off  a  new  task  or  issue  a  signiri,  but  must  not  try  to  wait  or  do  an  RFC  or  broadcast.  This  is 
because  a  filter  does  not  run  as  a  normal  task. 


The  state-transfer  utility  spools  some  messages  by  setting  itself  up  as  a  filter.  Its  filter  routine 
looks  like  this: 

zfer_filter(mp) 
register  message  *mp; 

{ 

if(<let  mp  through>) 
old_filter(mp); 
else 
{ 

if(xfer_queue  ■■  (queue*)0) 
xfer.queue  =  qujauDO; 
qu_add_mp(xfer_queue,  0,  mp,  nullroutine); 

} 

} 

When  a  transfer  finishes,  messages  are  despool ed  by  restoring  the  old  filter  and  then  replaying  the 
messages  into  it: 

zfer.despoolQ 
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register  queue  *qh,  *qp; 
ifunc  filter; 

(void)isis_setfilter(ol4_filter) 
filter  =  olcLfilter; 
qfa  *  xfer.queue; 
xfer.queue  =  (queue*)0; 
while(qp  **  qujiead(qh)) 

{ 

filta(qp->qujnp); 

qu_free(qp); 

} 

qu_free(qh); 

} 

Notice  how  die  replay  mechanism  “steps  to  the  side’*  by  using  a  copy  of  the  queue  pointer  and 
resetmg  the  old  value  to  null,  and  also  by  reacting  the  isis  filter  before  despooling  any  mtsuagt* 
This  is  necessary  to  ensure  that  a  reinvocation  of  the  state  transfer  tool  (or  some  other  entry  point 
that  changes  the  filter)  can  set  up  the  filter  again  and  create  a  new  transfer  spool.  If  this  idea  (a 
reentrant  procedure)  is  unfamiliar  to  you,  it  is  probably  not  a  good  idea  to  try  and  use  message 
filters  in  your  application:  the  sorts  of  bugs  that  you  may  run  into  are  pretty  bizarre! 
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1.  SyuBfwh 

Isis  initialization  routine. 

2.  Interface 

#indude  "d.h" 

isisjnit  (port-number) ; 


This  routine  connects  the  caller  program  to  ISIS  and  initializes  both  the  data  structures  used  by  the 
task  and  toolkit  facilities  and  various  constants,  such  as  my_processJd,  my.address,  my_site_no, 
my_site_incam,  and  the  site_names(]  table  (site_namea[i]  is  a  printable  name  for  site  i).  The 
port-number  may  be  given  as  0,  in  which  case  the  value  in  the  system-file  /etc/ services  is  used,  or 
8S  a  non-zero  number  in  which  case  the  specified  port  is  assumed  to  be  the  one  to  connect  to  ISIS 
on.  See  ADDRESSING(TK)  for  a  discussion  of  die  sites  file  which  contains  the  port  number 
specification.  STARTUP(TK)  gives  some  examples  of  programs  that  call  isisjnit,  and  FILES(TK) 
talks  about  where  these  port  numbers  come  from. 

System  services  like  REXEC,  RMGR,  etc.  should  set  the  variable  my_process  jd  to  their  “name'* 
before  calling  isisjnit. 

The  task  routines  may  crash  if  called  before  isis  JmtQ  is  called. 
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1.  Synopris 

How  to  manage  replicated  data  structures  that  can  be  recovered  after  total  failures  (all  process 
group  members  die). 

One  problem  confronted  by  the  ISIS  programmer  relates  to  the  recovery  of  a  replicated  data  struc¬ 
ture  after  a  failure.  In  a  situation  where  someone  survived  the  failure,  this  is  easy:  the  state 
transfer  tool  can  be  used,  with  the  recovery  manager  giving  advice  an  When  to  do  the  transfer. 
But,  what  if  everyone  crashes?  The  recovery  manager  will  wait  until  all  the  relevant  sites  come 
back  up,  then  restart  the  failed  processes,  but  exactly  how  should  die  m lasing  data  structures  be 
rebuilt? 

In  many  systems,  the  solution  to  has  been  to  resort  to  transactional  files  an  disk.  You  can  do  this 
in  ISIS  too  (see  TRANSACTIONS(TK) ) .  TransactionaQy  updated  files  survive  failures  and  are 
updated  atomically,  hence  anything  damaged  by  a  crash  will  automatically  be  restored.  In  ISIS, 
however,  the  entire  idea  is  to  move  away  from  nested  transactions  and  towards  less  costly,  less 
intrusive  mechanisms. 

A  simple  way  to  deal  with  recovery,  and  the  one  we  recommend,  is  as  follows:  Replicated  is 
maintained  in  core,  in  a  volatile  form,  using  tools  like  the  replicated  update  tool  to  query  and 
update  it.  When  an  update  changes  the  structure  in  a  significant  way,  however  -  a  way  that  needs 
to  be  preserved  after  failure  -  die  update  routine  should  “log”  the  change  in  a  well-known  file  in 
a  well-known  place.  For  example,  you  could  aeate  a  file  called  “/usr/spool/logs/my_prog.log”. 
Use  a  simple  file  format  (ascii  is  nice)  and  keep  appending  to  it  and  doing  a  flush  (fsync(2))  after 
each  write.  On  recovery,  you  will  need  to  reread  the  file  and  rebuild  the  data  structure  one 
change  at  a  time. 

One  thing  to  be  aware  of  is  that  when  several  processes  die  at  once,  they  could  leave  their  logs  in 
slightly  different  states.  Since  the  recovery  manager  arranges  for  aD  the  last  processes  to  fail  to 
recover  at  the  same  time,  your  recovery  mechanism  should  avoid  a  situation  where  these  processes 
recover  independently  from  logs  that  could  have  different  lengths  or  contain  non-indeutical  infor¬ 
mation.  Some  ways  to  do  tins  are:  have  a  coordinator  recover  from  one  of  the  logs  and  then  use  a 
state  transfer  to  copy  this  information  to  other  processes,  or  maintain  the  logs  in  a  cannonical 
order  and  reach  agreement  on  what  the  last  record  is,  or  maintain  identical  logs  and  reach  agree¬ 
ment  on  what  die  length  is.  To  reach  agreement,  use  a  group  RFC  that  queries  the  other 
processes  with  copies  of  die  log  and  take  the  minimum  length  that  is  returned  m  the  answer. 

Frankly,  unless  your  application  uses  ABCAST  whenever  an  update  to  the  in-care  storage  occurs, 
which  makes  it  trivial  to  keep  the  logs  in  a  cannonical  order,  we  recommend  that  you  go  with  the 
coordinator-cohort  solution.  It  just  isn't  worth  going  to  so  much  trouble  to  deal  with  a  situation 
that  almost  never  happens  anyway.  Even  this  solution  won’t  be  trivial,  because  you  have  to  cover 
two  cases:  one  where  the  coordinator  succeeds  in  loading  its  log  and  other  processes  just  do  state 
transfers  to  join,  and  one  where  the  coordinator  dies  whOe  loading  the  log  or  during  a  state 
transfer,  forcing  a  cohort  to  restart  from  its  own  version  of  the  log.  A  tool  that  does  this  will  be 
provided  as  part  of  the  recovery  manager  later  this  summer,  and  an  example  that  uses  it  will  be 
included  in  the  next  edition  of  this  documentation.  The  code  isn't  nearly  as  complex  as  it  prob¬ 
ably  sounds. 

If  your  logs  are  likely  to  get  very  long,  a  periodic  checkpoint  might  be  a  good  idea.  You  can 
create  one  by  supporting  a  “checkpoint”  log  record,  which  would  always  appear  at  the  beginning 
of  a  log.  To  make  a  new  checkpoint,  first  write  it  into  a  temporary  file,  then  rename  the  file  in 
one  shot  so  as  to  atonriaDy  switch  from  the  old  log  to  the  new  one.  (See  rename(2)  in  the  4.2bsd 
UNIX  manual). 
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1.  Synopsis 

An  overview  of  messages  as  they  are  used  in  ISIS. 

2.  Include  fDe 

#indude  <  isis/d.  h> 

In  ISIS,  communication  is  via  messages.  Basically,  a  message  is  a  container  for  some  amount  of 
data,  organized  as  a  vector  of  “fields”,  and  possessing  certain  standard  attributes.  Specifically, 
each  message  has  a  sender,  which  is  the  address  of  die  process  that  sent  it  (see 
ADDRESSING(TK)),  a  list  of  destinations,  which  is  a  null-terminated  address  list,  and  a  set  of 
fields  containing  data.  The  message  editing  system  (see  MSG_J2DrT(TK))  provides  a  variety  of 
routines  to  create  and  manipulate  messages.  Here,  we  confine  ourselves  to  a  summary  that 
focuses  on  concepts  rather  thm  detail. 

The  type  of  a  message  is  “message”,  and  is  predefined  in  d.h.  The  data  structure  used  is  quite 
complex.  An  empty  message  is  created  by  the  routine  msg_newmsgO,  which  returns  a  pointer  to 
the  message  but  doesn’t  fin  out  any  of  its  fields.  The  sender  field  of  a  message  is  automatically 
set  by  the  system  in  most  cases,  so  applications  can  assume  dint  this  information  is  “secure”.  The 
routine  msg  getseadcr(mag)  returns  a  pointer  to  this  field  or  ((address*)  0)  if  it  is  undefined. 
Similarly,  the  routine  msg_setdests(msg , alist)  seta  the  destination  list  of  a  message  and 
msg  getdestspnsg)  returns  a  pointer  to  it.  These  can  often  be  left  undefined,  for  example  if  the 
message  is  to  be  broadcast:  in  such  situations,  the  alist  is  computed  by  the  ISIS  communication 
subsystem  and  filled  in  automatically.  (If  the  sender  field  is  not  filled  in,  die  message  subsystem 
uses  the  address  of  the  process  that  called  msgjiewmsgO  by  default).  The  routine 
msg  getdests(mp.len).  however,  is  quite  useful:  it  returns  a  list  of  the  destinations  to  which  copies 
of  a  message  were  to  be  delivered.  For  example,  this  would  be  a  good  value  to  provide  to  coord- 
cohortO  as  an  alist.  If  you  use  the  destination  list  for  any  other  purpose,  be  aware  that  your  pro¬ 
cess  group  view  should  be  checked  to  confirm  that  these  members  are  still  operational,  see 
pg  getvicwf)  in  PGROUP(TK)  for  details. 

A  message  has  no  upper  size  limit,  but  ISIS  tends  to  get  side  when  messages  exceed  a  few 
hundred-thousand  bytes  in  length.  Some  parts  of  die  system  break  messages  into  4k  chunks  for 
transmission,  so  4k  is  a  good  upper  limit.  Since  messages  have  overhead,  the  user-space  available 
in  a  4k  message  is  only  about  3.9k  bytes. 

Message  fields  ate  used  to  store  data  in  a  message.  A  field  has  a  name,  represented  by  an  integer 
in  the  range  1-127,  a  type,  a  value,  and  a  length-  There  are  currently  three  types  of  fields, 
although  more  may  be  supported  in  the  future: 

1.  Character  fields  are  sequences  of  bytes  having  the  designated  length.  These  are  not  inter¬ 
preted  by  the  message  subsystem.  Users  who  employ  a  stub  generator  such  as  the  UNIX 
XDR  mechanism  should  think  of  the  XDR  output  as  a  character  field  even  if  it  represents 
multiple  arguments  or  data  hems.  The  type  of  a  character  field  is  FTYPELCHARS. 

2.  Long  integers  have  different  byte  orders  on  different  machines.  Within  ISIS,  many  data 
structures  consist  of  vectors  of  long  integers.  A  field  containing  long-integers  will  automati¬ 
cally  be  byte  swapped  an  arrival  at  a  remote  machine  if  necessary,  but  the  message  editing 
system  must  be  told  that  the  field  is  not  just  a  field  full  of  characters.  To  do  this,  the  type 
field  should  be  specified  as  FTYPEJLX3NGINTS.  The  length  of  the  field  should  be  given  in 
bytes.  The  message  editing  subsystem  knows  about  byte  orders  automatically  and  will  swap 
bytes  as  needed. 

3.  Other  messages.  The  idea  is  that  a  message  can  be  stuffed  into  another  message,  which  is 
convenient  when  multiple  messages  need  to  be  piggybacked  to  a  single  destination,  or  when 
extra  fields  need  to  be  to  a  message  without  risk  of  banging  into  fields  already  in  the 
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message.  The  type  of  a  message  field  is  automatically  set  to  FTYPE_MSG. 

4.  Note:  ISIS  currently  uses  an  address  format  that  is  mnrhin*  independent.  Lest  this  diange, 
a  field  type  FTYFfLADDRS  is  supported. 

5.  The  following  addition  field  types  are  support:  FTYPE_SHORTTNTS,  FTYPE_SIDS  (site- 
id’s),  FTYPE_PGVTEW  (process  group  view). 

The  routines  used  to  manipulate  message  fields  are  as  follows: 
msg_addfield(mp>fname,ptr)ftype,len) 

Adds  a  field  named  fncant  to  the  message  having  a  value  copied  from  the  place  the  pointer 
points  to  and  length  given  by  len.  The  field  name  need  not  be  unique.  The  type  is  as 
described  above. 

msg_getfield(mplfname,inst,lenpcr) 

The  message  is  searched  for  the  inst’th  instance  of  the  designated  field  name  (the  first 
instance  is  number  1).  If  found,  a  pointer  to  the  value  is  returned  and  if  lenptr  is  non-zero, 
the  integer  variable  it  paints  to  is  set  to  the  length  in  bytes  (or  elements)  of  the  character 
field  (or  long  integer  field). 

msg_deletefield(mp,fname,inst) 

Deletes  the  i’th  instance  of  the  named  Grid. 

msg_addmsg(mp,fhame,ptr) 

If  ptr  is  a  pointer  to  a  message,  creates  a  field  with  the  given  name  containing  this  message. 
msg__getmsg(mp,fname,inst) 

Returns  a  copy  of  die  message  in  the  inst’th  field  having  the  given  fname. 

There  are  additional  routines  that  can  be  used  to  retrieve  multiple  values  or  messages  at  once 
when  a  message  is  expected  to  have  several  field  instances  with  a  single  field  name.  Also  useful  is 
a  procedure  called  msg_createO  that  creates  a  message  and  initializes  a  number  of  fields  at  the 
same  time;  it  takes  a  list  of  field  names,  values,  and  lengths  terminated  by  a  field  name  of  0.  See 
the  MSGLEDIT(TK)  documentation  for  details. 

4.  Fieldnames 

Some  field  names  are  standard  in  ISIS.  These  have  128-253  and  are  defined  n  msg.h,  they  can  be 
fetched  using  msg_getfield  but  not  set.  Field  numbers  1-127  are  available  far  general  use,  and  dif¬ 
ferent  applications  will  tend  to  use  the  same  numbers  for  different  purposes.  Number  0  is  not 
used.  It  is  generally  a  good  idea  not  to  reuse  field  numbers  within  any  single  application;  this 
avoids  confusion. 


5.  Sending  a  message 

A  number  of  routines  facilitate  the  transmission  of  messages.  The  most  commonly  used  routines 
are  the  BCASTO  routines;  see  BCAST(TK).  In  ISIS  diems,  the  procedure  isis_send(dest,msg) 
can  be  used  to  transmit  a  message  to  a  destination  given  by  the  address  dest.  This  routine  is  used 
heavily  within  the  system.  A  second  routine,  isisjrpc(dest,msg),  sends  the  message,  waits  for  a 
reply,  and  then  return  the  reply  message  to  the  caller.  The  answer  itself  will  be  in  the  field 
FLD_ANSW.  If  the  answer  was  given  as  a  null  pointer,  this  field  will  not  be  defined.  If  no  reply 
win  arrive  because  of  a  failure,  isis_rpc  return  a  null  pointer.  Broadcasts  are  sent  unmg  any  of  a 
variety  of  procedures  documented  in  BCASTS(TK). 

Each  message  has  a  reference  count  which  can  be  incremented  by  the  procedure 
msgJnrrefcountQ.  The  reference  count  is  initially  0  and  need  only  be  incremented  when  the  mes¬ 
sage  is  placed  on  a  queue  or  otherwise  passed  to  some  task  other  than  the  original  creator.  Later, 
the  message  reference  count  is  decremented  by  calling  msg_deleteO-  If  the  count  was  0, 
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msg^ddeteO  wffl  free  the  message  for  reuse.  While  a  message  has  a  non- zero  reference  count,  it 
is  illegal  to  add  or  delete  fields:  such  a  message  is  said  to  be  immutable,  and  optimizations  »»Kng 
advantage  of  this  property  have  been  included  into  the  msg_addmsgO  msg_getmsgQ  utilities. 


7.  Byte  swapping  and  arid  news 

The  address  format  used  in  ISIS  has  been  designed  so  that  even  in  future  versions  of  the  system, 
there  will  be  no  need  to  byte-swap  addresses  when  they  are  passed  from  process  to  process  across 
machine  boundaries. 


Some  systems  allow  a  process  to  forward  a  message  transparently:  to  the  end  recipient  it  looks 
if  this  message  came  directly  from  the  recipient.  ISIS  doesn't  allow  you  to  do  das,  although 
would  be  easy  enough  to  support  if  desired.  Tbt  problem  is  that  such  a  appears 

break  our  security  model.  To  forward  a  message  using  the  current  system,  you  would 
pack  it  into  some  other  message  using  msg_addmsgO  end  then  convince  the  destination  to  ""pffr 
the  message  and  deliver  h  locally,  by  calling  d_jocal_deliveryO  on  it. 
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/*  msg_addfield  :  Add  a  new  field  to  a  given  message.  Only  non-negative  7 
/*  field  names  are  aocepted.  Returns  a  pointer  to  die  7 

r  field.  •/ 


char  • 

msg_addfield  (msg,  field,  data,  type,  len) 

message  *msg; 

int  field,  type,  len; 

char  ’data; 


/*  msg_addmsg  :  Insert  a  msg2  into  msgl  as  a  field  7 
f*  Only  non-negative  field  names  are  accepted.  7 

r  Returns  pointer  to  field  7 

ftif  • 

msg_addmsg  (msgl,  msg2,  field) 
menage  *msgl,  *msg2; 

int  field; 


/*  msg_copy:  Make  a  copy  of  the  given  — ny  7 


message  * 
rrug_copy  (msg) 
message  *msg; 


r  msg_delete:  If  the  reference  count  of  the  given  mmsgr  is  zero,  7 
/*  release  the  space  allocated  to  it.  Fhc,  deoement  7 

t*  the  reference  count.  CaDed  by  the  routine  that  7 

I*  allocated  the  menage  and  by  my  routine  that  •/ 

/*  mug  inenrfwnnit  QQ  the  message  7 

void 

msg_delete  (msg) 
message  *msg; 


/*  msg_deletefield  :  Delete  the  given  mstenes  of  a  field  7 
void 

msg_deletefidd  (msg,  field,  mat) 
message  ’msg; 
int  field,  met; 


/*  msg_gemnsg:  Generate  a  message  containing  the  fields  passed  as  7 
I*  arguments,  which  are  of  the  form  fiekLname,  7 

I*  pointer  _to_data,  fieldjength,  ...  ,  followed  by  0  7 
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message  * 

msg  genmsg  (fleldl,  datal,  typel,  lenl,  field2,  data2,  type2,  len2, ....  0) 
int  fields,  types,  lens; 
char  *datas; 


F  msg^getdests:  Return  a  pointer  to  a  null-terminated  list  of  the  '/ 
/*  destinations  of  a  given  message  (null  pointer,  if  none)  7 

/*  If  a  non-zero  argument  is  given  for  n_dests,  the  number  7 

F  destinations  is  returned  7 

address  * 

msg  getdests  (msg,  n_dests) 
message  *msg; 
int  *n_dests; 


F  msg  getfield:  Return  a  pointer  to  the  daU  in  a  given  instance  of  a  7 
F  field  and  the  length  of  the  field  if  the  last  argument  7 

F  is  non-zero.  7 

char  * 

msg  getfield  (msg,  field,  inst,  len) 
message  *msg; 
int  field,  inst; 

int  *len; 


/*  msg  gctfields:  Return  an  array  of  pointers  to  the  data  of  up  to  the  first  7 
/•  n  instances  of  a  given  field,  and  their  lengths  (if  the  7 

F  argument  is  non-zero).  Return  value:  number  of  fields  */ 

/*  actually  found.  */ 

msg  getfields  (msg,  field,  pointers,  lengths,  n) 
message  *msg; 
int  field,  lengtfasQ,  n; 
char  *pointersQ; 


/•  msg  getiovec  :  Return  a  pointer  to  an  array  of  iovec  structures  far  */ 
/*  of  a  given  message.  See  write(2)  and  read  (2).  */ 

/•  msg  getiovlea:  Return  the  length  of  the  iovec  array.  */ 

/•  (These  should  really  be  macros).  */ 

struct  iovec  • 
msg  getiovec  (msg) 

message  *msg; 

msg  getinvlen  (msg) 
message  *m^; 


F  msg  getlen  :  Return  the  length  of  the  transmittable  part  of  a  given  message  */ 
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msg_getlen  (msg) 
message  *msg; 


I*  msg  getmsg  :  Create  a  message  from  the  contents  of  a  given  instance  */ 
/*  of  a  field.  (It  is  amimrrl  that  the  field  has  a  proper  */ 

/*  message  header.)  •/ 

message  * 

msg  getmsg  (msg,  field,  hot) 
message  *msg; 
int  field,  inst; 


/*  mag  getmsgs  :  Return  a  vector  of  messages  created  from  up  to  the  first  */ 
r  n  instances  of  a  given  field,  and  return  the  actual  •/ 

/*  number  of  such  messages,  (ft  is  assumed  that  each  •/ 

/*  field  has  a  proper  header.)  •/ 

msg_getmsg»  (msg,  field,  messages,  n) 
message  •msg,  (*meesages)Q; 
int  field,  n; 


P  msg  getrcpNto :  Return  the  address  to  send  the  reply  for  a  given  message  •/ 
/*  Caution:  tibia  may  not  be  the  sender!  */ 

addreu  * 

msg_getreplyto  (mag) 
message  *msg; 


r  msg_getsender  :  Return  the  address  of  the  sender  for  a  given  message  */ 
/*  Caution:  this  may  not  be  the  place  to  send  replies!  */ 

address* 

msgjetreptyto  (mag) 
memage  'msg; 


I*  msgjncrefcount :  Increment  the  reference  count  of  a  given  mrtsngr  */ 

/*  Caution:  A  message  with  multiple  ref*  emwnt  be  changed  V 

void 

magjncxef count  (mag) 
message  *mag; 


/*  msg_newmsg:  Create  an  new  message  with  sender 
msgjjewmag  Q 


field  filled  in*/ 
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/*  msg_read:  Read  and  return  a  message  from  a  file  descriptor  '/ 

/*  Second  argument  gives  length  if  known,  0  otherwise  '/ 

message  * 
msg_read  (sd,  len) 
int  sd; 
int  len; 

f*  msg_reconstruct:  reconstruct  the  argument  into  a  message  and  return  a  ptr  •/ 
message  * 

msg  reconstruct  (ptr) 
char  *ptr; 

/•  msg_reconstruct_inplace:  reconstruct  the  in  (dace  and  return  a  ptr  to  its  header  '/ 
r  Caution:  assumes  the  ptr  points  to  a  maDoc  region;  win  be  freed  */ 

r  automatically  later  by  the  message  editing  system  */ 

message  * 

msg_reconstructjnplaoe  (ptr) 
char  'ptr; 

/*  msg_setdest :  Set  the  destination  field  to  the  given  destination  '/ 
void 

msg_$etdest  (msg,  dest) 
message  *msg; 
address  'dest; 


/*  msg_setdests  :  Set  the  destination  field  to  the  given  null- terminated  */ 
r  list  of  destinations  */ 

void 

msg_sctdests  (msg,  dests) 
message  *msg; 
address  destsfl; 


I*  msgjKtreplyto:  Set  the  reply-to  field  to  the  given  address  */ 

/'  Note:  then  cannot  set  the  sender  address  -  this  is  automatic  */ 

void 

msg_setreplyto  (msg,  who)) 
message  *msg; 
address  who; 


r  msg_write:  Write  the  given  message  on  the  given  file  descriptor  */ 

msg_write  (sd,  msg) 
int  id; 

message  *msg; 
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1.  SfDOfMti 

The  News  Service  allows  an  isis  diem  to  post  messages  which  wiO  automaticaOy  be  forwarded  to 
other  processes  that  are  subscribers  of  the  news  service. 

2.  Interface 

#indude  <isia/d.h> 

news_post(slist,  subject,  mp,  back) 

news_posta(slist,  subject,  mp,  back) 
sitejd  slistQ; 
char  'subject; 
message  *mp; 
int  back; 

news_dear(slist,  subject) 

news_dear_aD(slist,  subject) 
sitejd  slistQ; 
char  'subject; 


nback  *  news _subscribe< subject,  entry,  back); 
int  nback; 
char  'subject; 
int  entry,  back; 

news_canoeI(subject) ; 
char  'subject; 


To  post  a  message,  a  process  calls  news _post.  Slut  is  a  list  of  sites  to  which  the  message  win  be 
forwarded;  if  a  nuO  pointer  is  given  instead,  the  message  win  be  forwarded  to  all  operational  sites. 
Subject  is  an  arbitrary  string  of  up  to  SUBJLEN  characters.  For  every  subject  the  news  service  at 
each  site  keeps  a  list  of  recently  posted  messages  which  new  subscribers  may  lode  at.  When  a 
message  is  posted,  the  parameter  back  determines  bow  long  the  message  win  be  held  as  a  "back 
issue”.  If  back  *  0,  the  mess  age  win  be  forwarded  only  to  current  subscribers  and  win  be  deleted 
immediately  afterwards.  If  back  is  greater  than  zero,  for  example  back  *  5,  the  message  win  be 
held  until  five  new  messages  have  been  posted  to  the  same  subject. 

News_post  uses  CBCAST  to  broadcast  the  message  to  the  news  services  at  other  sites.  If  it  is 
important  that  aD  subscribers  receive  news  messages  in  the  same  order,  then  newsjsosta  should  be 
called,  which  win  use  an  ABCAST  to  post  the  message. 

Messages  kept  as  back  issues  on  a  certain  subject  may  be  deleted  explicitly  by  caning  « ews_clear 
(deletes  an  messages  posted  by  the  caller),  or  newsjclear_pll  (deletes  sD  messages  posted  by  any¬ 
body). 

A  process  that  wishes  to  subscribe  to  a  news  tub  Jet  calls  news  ^subscribe,  specifying  an  entry  point 
(declared  by  isis.entryO.  see  ENTRIESCTK))  to  which  the  news  service  win  send  messages  potted 
to  that  subject.  The  parameter  back  spcdfica  how  many  back  issues  the  subscriber  wants  to 
receive.  News_subsoibe  returns  the  actual  number  of  back  issues  available,  that  win  be  tent  to  the 
subscriber. 
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When  a  message  is  posted,  the  news  service  automatically  the  two  fields  FLD  SUBJ  and 
FLD_BACX  containing  die  ’subject’  and  ’back’  parameter  from  the  news_post()  call.  A  subscriber 
may  inspect  these  fields  simply  by  ewtiing  msg  getfieldC). 

News_cancel  cancels  a  subscription  for  a  given  subject. 

4.  Diagnostics 

All  routines  return  a  nonzero  value  in  the  case  of  an  error,  except  for  news_subscribe,  which  indi¬ 
cates  an  error  by  returning  a  negative  value.  Note  that  news_post  and  news_posta  do  not  wait  for 
replies  when  broadcasting  a  message.  Therefore  a  successful  return  does  not  yet  guarantee  that 
the  message  has  been  delivered  to  remote  sites. 

5.  Bop 

A  site  crash  wipes  out  all  back  issues  held  by  the  local  news  service.  The  news  service  does  not 
save  messages  on  stable  storage,  nor  does  it  attempt  to  get  bade  issues  from  some  other  site  after 
a  recovery. 

This  version  of  the  news  service  does  not  provide  any  form  of  security.  Any  isis  dient  can  post 
and  receive  messages  on  any  subject;  h  can  delete  any  back  issues  it  wants  to  (news_dear_all). 

News  uses  a  linear  search  to  find  subjects  in  its  tables.  Hashing  should  be  used  instead. 
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1.  Synopda 

A  package  of  routines  implementing  process  groups  and  group  addressing. 

2.  Interface 

#indude  <isis/d.h> 

isisJnit(O); 

pgJmtO; 


address  pg_areate(name, incam) 
dutr  'name; 
int  incam; 

address  pgJookup(name) 
char  'name; 

pg_addmemb(gid,  who) 
address  gid,  who; 

pgjeave(gid) 
address  gid; 

pgjnigrateCgid,  newpname) 
address  gid,  newpname; 

pg_delete(gid) 
address  gid; 

pg_adddient(gjd,  dient) 
address  gid,  cheat; 

pg_deldient(gid,  cheat) 
address  gid,  cheat; 

address  *pg_getview(gid) 
address  gid; 

pgJockview(gid) 
address  gid; 

pg_signal(gid,  signo) 
address  gid; 
int  signo; 

pgjnonitor(gid,  routine,  arg) 
address  gid; 
int  ('routine)O; 
char  'arg; 


PGROUFfTK) 


pgjnonitor_cancd(gid,  routine,  arg) 
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address  gid; 
iat  (*routine)0; 
char  *arg; 

pgJoin(gid,  mp) 
address  gid; 
message  *mp; 

pg  join.  verifierfroutine) 

int  (*routine)0; 

pg_dumpQ 


3.  Dtoaion 

This  package  maintains  process  group  membership  information.  There  are  two  ways  a  process  can 

relate  to  a  group: 

1.  As  a  client  that  sends  requests  to  the  group.  Due  to  a  restriction  on  addressing  modes,  we 
distinguish  between  die  case  of  a  client  known  to  the  group,  with  unrestricted  permission  to 
use  the  group  address  in  its  address  lists  (see  BCAST(TK)),  and  die  case  of  a  dient  that  the 
group  doesn’t  know  about  in  this  sense,  who  can  only  broadcast  to  the  group  in  a  restricted 
manner  (see  BCASTfTK)  again).  Groups  can  promote  a  dient  to  the  more  powerful  form 
of  addressing  using  pg_adddientO,  but  this  must  be  done  by  a  current  member.  The  routine 
is  idempotent,  so  calling  it  a  few  times  won’t  crash  the  system  or  anything,  but  it  might  be 
alow.  A  dient  can  monitor  group  membership  changes,  but  will  not  receive  broadcasts  sent 
to  the  group  and  cannot  initiate  membership  changes  or  add  other  clients  on  its  own. 

2.  As  a  member.  Group  members  receive  messages  sent  to  the  group  and  have  unlimited  free¬ 
dom  to  call  the  routines  defined  above.  Group  members  impliddy  have  unlimited  address¬ 
ing  freedom  with  respect  to  their  group. 

We  now  describe  the  various  routines  available  to  group  members  snd  their  clients.  See  also  the 

discussion  in  PROTECITON(TK),  where  the  mechanisms  for  preventing  unauthorized  use  of  a 

group  are  documented. 

a)  pg_createQ  creates  a  new  process  group;  its  only  member  will  be  the  caller  process.  The 
group  may  be  given  a  symbolic  name  (if  none  is  desired,  a  null  painter  should  be  passed  for 
the  name).  The  system  will  not  verify  that  the  name  is  unique.  It  may  also  be  assigned  an 
incarnation  number;  this  is  done  by  and  used  by  die  recovery  manager  (see  RMGR(TK)  for 
details;  this  may  be  specified  as  0  if  the  rmgr  is  not  being  used).  The  group  will  continue  to 
exist  until  deleted  with  pg_delete  unless  all  its  members  fail,  at  which  time  it  will  be  deleted 
automatically.  See  BCASTfTK)  for  details  concerning  group  addressing.  A  process  can  be  a 
member  of  an  unlimited  number  of  groups. 

b)  pgjookupfname)  looks  for  a  group  with  the  given  symbolic  name  and  returns  its  group  id. 
The  search  is  done  in  all  sites  that  are  "local”  and  "long  distance”,  but  not  those  that  are 
“remote”  relative  to  the  caller  (see  ADDRESSING(TK)  for  definitions  of  local  and  remote). 
If  the  name  is  not  found,  NULLADDRESS  is  returned.  In  the  future,  pgJookupO  will  be 
extended  to  support  some  form  of  pattern  matching  and  a  permission  scheme  under  which  it 
will  only  be  possible  to  lookup  a  name  if  one  has  “permission”  to  access  it.  If  several 
groups  match,  the  first  address  found  is  returned  to  the  caller. 

c)  pg_addmemb()  adds  a  new  process  to  a  preexisting  group.  It  can  only  be  done  by  a  group 
member;  this  is  to  allow  the  group  members  to  validate  new  potential  members.  It  fails, 
returning  an  error  code,  if  the  group  does  not  exist,  the  caller  is  not  a  member,  or  the  pro¬ 
cess  is  already  a  member.  Since  this  call  can  only  be  done  by  a  member,  pg  joinQ  is  pro¬ 
vided  as  a  convenient  way  to  ask  a  group  to  let  a  potential  member  join. 
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d)  pgJeaveO  deletes  the  caller  from  the  group. 

e)  pgjnigrateO  simultaneously  adds  process  “newpname"  and  deletes  the  caller  process  from 
the  group.  The  discussion  of  pg_addmemb  applies.  It  fails  if  the  group  does  not  exist,  the 
caller  is  not  a  member,  or  newpname  is  not  specified  correctly.  This  routine  has  not  yet  been 
implemented. 

f)  pg_delete()  deletes  the  designated  group  even  if  it  still  has  members.  Hie  caller  must  be  a 
member  of  the  group.  It  fails  if  the  group  cannot  be  found  or  the  caller  is  not  a  member. 

g)  pg_adddientO  makes  the  group  directly  accessible  by  the  danignatad  client,  but  without  mak¬ 
ing  the  client  a  member  of  the  group.  This  is  necessary  if  the  dient  is  to  use  some  of  the 
more  sophisticated  addressing  modes  identified  in  die  BCAST(TK)  routines.  Should  the 
dient  fail,  it  win  automatically  be  deleted  from  this  list. 

h)  pg_deldientO  deletes  a  direct  access  dient. 

i)  pg_getview()  returns  the  current  membership  of  a  group  as  a  pgroup  data  structure,  defim-H 
in  pr_group.h  and  automatically  included  by  d.h.  The  caller  need  not  be  a  member  and  the 
group  need  not  be  directly  accessible.  The  view  is  not  guaranteed  to  remain  imehimg»ri  it 
is,  however,  guaranteed  to  have  the  same  value  when  different  recipients  of  a  m^sage  all 
call  pg_getview()  when  the  message  first  arrives  (but  without  doing  a  t_waitO  first).  A  null 
pointer  is  returned  if  the  group  is  not  found. 

j)  pgJockviewO  locks  the  current  view  against  changes  and  returns  the  view,  which  the  caller 
can  use  to  compute  an  action  that  depends  an  the  current  membership.  The  lock  is  automat¬ 
ically  released  the  next  time  a  broadcast  is  done  to  die  group  by  this  process,  or  if  the  pro¬ 
cess  fails.  It  returns  a  null  pointer  if  the  group  is  not  foumL  This  routine  is  not  yet  imple¬ 
mented. 

k)  pg-signalO  sends  a  UNIX  signal  to  the  members  of  the  designated  groups  and  processes, 
which  are  identified  by  a  mill-terminated  address  list.  It  fails  if  the  caller  is  not  a  member  of 
the  group  or  the  group  does  not  exist. 

l)  pg_monitorO  monitors  the  designated  group  for  membership  changes.  The  caller  must  be  a 
member,  the  request  fails  if  this  is  not  die  case.  Should  the  membership  change,  the  call¬ 
back  routine  is  invoked  as:  routine(pg,  arg);  where  pg  is  a  pointer  to  a  pgroup  data  structure 
containing  the  new  view  and  the  argument  is  the  one  given  in  the  monitor  request. 

m)  pg_monitor_canoelO  cancels  a  pgjnonitor  request.  The  arguments  must  match  those  for  the 
pg_monitor.  It  fails  if  the  monitor  request  is  unknown. 

n)  pg_join()  is  a  toolkit  routine  that  does  an  RFC  to  the  designated  group,  passing  the  desig¬ 
nated  message.  The  message  pointer  can  be  null  if  the  group  doesn’t  do  verification.  The 
message  is  delivered  to  the  join  verification  routine  (see  below).  If  this  routine  is  undefined 
or  returns  0,  the  join  is  permitted.  If  it  is  defined  and  returns  an  error  code  (a  negative 
number),  the  join  is  aborted  and  pg  joinf)  returns  this  error  code.  If  the  caller  is  added  to 
the  group  successfully,  pgJoinO  returns  0.  A  process  can  be  a  member  of  an  unlimited 
number  of  groups. 

m)  pgJoin_verifier()  is  a  toolkit  routine  with  which  the  members  of  a  group  can  specify  a  rou¬ 
tine  that  will  validate  new  join  requests.  The  routine  is  later  invoked  as 

routine(mp) 
register  *mp; 

{ 

} 

where  mp  is  the  message  sent  in  the  pg_joinQ  request.  The  id  of  the  group  being  joined  is  avail¬ 
able  in  the  system  field  SYSFLD.GID  of  this  message.  The  join  verifier  is  only  called  at  one  of 

the  members  of  the  group,  but  the  actual  member  that  win  be  asked  to  do  the  verification  may 
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vary.  AQ  members  should  therefore  define  this  routine  if  any  does  and  all  should  use  the  same 
verification  rule. 

Except  when  otherwise  indicated,  all  the  routines  return  0  in  the  event  of  normal  termination  and 
•1  if  an  error  occurs. 


PROTECriON(TK) 


DISTRIBUTED  SYSTEMS  TOOLKIT 


PROTECTION(TK) 


1.  Synopdi 

Some  thoughts  on  protection  in  ISIS. 

2.  Relevant  Interfaces 

pgJoin(...) 

pg_migrate(...) 

pg_signal(...) 

au_verify(...) 

au_permit(...) 

au^revoke_perm(...) 

bcast(...) 


3.  Dhaflnn 

Because  ISIS  will  operate  in  large  networks  with  multiple  protection  domains  present,  protection 
within  ISIS  itself  poses  difficult  design  issues.  Two  basic  approaches  have  been  considered  in  this 
connection. 

The  first  approach  is  to  encrypt  capabilities  (group  and  process  id’s)  and  use  the  encrypted  capabil¬ 
ities  to  mediate  access  to  groups.  This  approach  has  been  rejected  because  die  size  of  the 
encrypted  capability  would  have  to  be  very  large,  hence  address  lists  (which  are  common  in  ISIS) 
could  get  very  big. 

The  alternative,  which  we  implemented,  treats  authentication  as  an  application-level  problem,  but 
provides  a  reasonable  degree  of  support  for  authenticating  access.  The  approach  is  as  follows. 

First,  members  of  a  process  group  are  given  a  chance  to  check  the  credentials  of  a  process  wishing 
to  join  the  group.  See  pgJoinO  for  details.  The  idea  is  that  the  group  members  specify  a 
join_verifier  routine  and  it  checks  the  legality  of  the  join  request.  ISIS  provides  the  sender’s 
address  in  a  secure  form,  but  leaves  it  to  the  recipient  processes  -  die  group  -  to  check  the 
sender’s  user  identification.  This  is  because  many  operating  systems  simply  do  not  provide  TSTS 
with  a  mechanism  for  securely  deducing  any  more  information  than  this. 

A  reasonably  secure  mechanism,  if  security  is  your  goal,  would  be  to  use  public  key  encryption  as 
part  of  these  authentication  procedure.  If  a  cheat  wishes  to  join,  it  would  be  required  to  present 
credentials  to  the  group,  obtaining  these  from  a  file  that  only  it  can  access,  and  encrypting  the 
information  to  prevent  eavesdroppers  from  learning  anything  useful. 

We  also  support  a  mechanism  far  authentication  of  individual  requests.  See  AUTHEN(TK)  for 
details.  The  interface  is  very  similar:  the  group  specifies  a  routine  that  authenticates  individual 
messages,  although  in  this  case  (to  avoid  extra  work),  it  is  also  possible  to  permit  all  messages 
from  a  particular  caller  to  be  processed  or  to  revoke  a  previously  granted  permission.  The 
mechanism  uses  the  filter  concept  outlined  in  FILTER(TK). 
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1.  Synopds 

A  simple  program  demonstrating  the  use  of  the  recovery  manager  routines. 
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2.  Program  Source 

. . . . . . . . . . . . . 

rdemo.c  -  a  simple  demonstration  on  how  to  use  the  rmgr  routines 
What  this  program  does: 

The  program  creates  a  process  group  named  "recovery-demo"  (or  joins  this 
group  if  it  exists),  prints  the  message  "rdemo:  startup  complete,  waiting 
for  command  messages",  and  (you  guessed  it)  waits  for  'command’  messages 
to  arrive  from  some  mysterious  source.  The  purpose  of  this  program  is  to 
demonstrate  how  to  use  the  recovery  manager  routines  to  have  the  program 
restarted  automatically  and  to  create  or  join  a  process  group  after  a  crash. 

How  to  run  this  program: 

1.  Start  up  isis  on  all  sites. 

2.  Install  an  entry  in  die  recovery  manager  restart  database  by  typing: 

rmupdate  1  rdemo- program  rdemo  rdemo  12S2 
This  command  creates  an  entry  in  the  restart  database  for  site  1,  with 
key  »  "rdemo- program",  program  ■  rdemo,  and  argv  =  {rdemo  1252}. 

The  rdemo  program  expects  argv[l]  to  contain  the  internet  port  number 
for  talking  to  isis.  Replace  1252  by  the  correct  number  far  that  site.  See 
INTT(TK),  FTLES(TK),  RMUPDATE(TK)  for  details.  You  have  to  repeat 
this  command  for  every  isis  site  on  which  you  want  to  run  the  demo. 

3.  Start  the  demo  program  at  one  of  the  sites  (e.g.  site  1)  by  typing 

rdemo  1252 

4.  Start  the  demo  program  at  the  other  sites. 

5.  Now  you  can  experiment  to  see  what  happens  if  you  crash  one  of  the 
programs  (with  cntrl-C  or  TriH’)  or  one  of  the  sites  (by  kitting 

the  isis  protocols  process). 

The  program  should  be  restarted  automatically  each  time  it  is  killed, 
or  after  a  site  recovers  from  a  crash.  If  you  crash  all  sites  and  then 
bring  them  up  again,  die  process  group  will  be  recreated  on  the  site 
that  died  last.  The  rdemo  programs  at  the  other  sites  will  wait  until 
rdemo  is  restarted  on  the  last  surviver’  site. 

To  remove  the  rdemo  entry  from  the  restart  database  type 
rmupdate  1  rdemo- program 
at  site  1,  and  similar  for  the  other  sites. 

. . * . . . . . . . . . .../ 


#indude  <stdio.h> 

#indude  "d.h" 

#indude  "cLrmgr.h" 

# define  MSG.COMMAND  (USER_BASE+0)  f*  isis  entry  number  for  command  message  */ 
# define  FLD_TEXT  1  /*  message  field  for  tynmanri  text  */ 
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. . . . . . 

*  main_task  -  starts  up  the  process  group 

. . . . *********************  V 


void  main_taskO 

{ 

address  gid;  f*  process  group  address  */ 


r 

*  Register  this  process  with  the  recovery  manager. 

*  This  is  only  necessary  if  the  program  was  not  started  by  the  recovery 

*  manager  (i.e.  the  first  time,  when  it  is  started  *by  hand’).  However, 

*  it  does  not  hurt  to  always  call  rmgr _reguter(). 

•/ 

rmgr.jegiiter("rdcmo-program") ; 

r 

*  Restart  the  process  group. 

*  Rmgr  .restart  0  will  call  rmgr_getinfoO  and  based  on  that  information 

*  decide  whether  to  just  join  the  group,  create  the  group,  wait  far  another 

*  site  to  create  it,  ....  See  RMGRfTK)  for  details. 

•/ 

gid  =»  rmgr _restart("recovery-demo”) ; 
if  (gid.site  -*»  0)  { 

printf(Mrdemo:  rmgr_restart  failecta"); 
erit(-l); 

> 

/• 

*  Start  recovery  manager  view  logging. 

*  After  this  call  the  latest  pgroup  view  is  automatically 

*  save  on  stable  storage. 

7 

rmgr jstartJog(gid) ; 

printf("rdemo:  startup  complete,  waiting  for  command  messages\n"); 

} 


*  msg_command  -  message  handler  for  command  messages 

••••• . . . ********** . . . •••••••••/ 


void  msg_command(mp) 
message  *mp; 

{ 

char  ‘text; 

text  a  msg  getfieldfmp.  FLD_TEXT,  1,  (ixrt  *)0); 
if  (strcmp(text,  "quit")  *=  0)  { 
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/• 

*  Call  rmgr.unregister  to  tell  the  recovery  manager  that 

*  the  program  wants  to  exit  without  being  restarted. 

*/ 

rmgr_unregisterO ; 
exit(O); 


}  else  if  (strcmp(text,  "crash")  ==  0)  { 

/• 

*  Commit  suicide.  Hie  recovery  manager  wiD  notice  that  the 

*  program  has  died  and  will  restart  it  automatically. 

•/ 

exit(0); 

}  else  if  (stmanp(text,  "echo",  4)  =  =  0)  { 

r 

*  Echo  the  message  text  to  the  screen. 

•/ 

printf("rdemo:  %s\n",  text +4); 

}  else  { 

printf("rdemo:  unknown  command:  %s\n",  text); 

} 

} 


main 

. . . . . 


main(argc,  argv) 
int  argc; 
char  *argv[]; 

{ 

int  client_port;  /*  port  number  for  talking  to  isis  */ 

/• 

*  get  dient_port  from  argv 
•/ 

if  (argc  !*  2  |  (dient_part  »  atoi(argv[l]))  *»=»  0)  { 
fprintf(stderr,  "usage:  %t  portSn",  argv[0]); 
exit(*l); 

} 

/• 

•  set  up  isis  stuff 
7 

isis Jnit(dient_port) ; 

nngrJnitO; 

isis  entry  (MSG  COMMAND,  msg_cotnmand,  "msg_command"); 

/• 
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t_fork^delayed(main_t*»k,  0,  0); 

for  (;;)  { 

nm_taaks0; 

isis_readO; 

} 

> 
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1.  Sjraaprii 

A  toolkit  routine  for  «n*nsgmg  replicated  data. 

#indude  <isis/d.h> 

In  the  service: 
iaisjnit(0); 

rjnit(itenxjuane,  opener,  sizer,  reader,  writer,  fsyncer,  default) 
char  'item^name; 

int  ('opener)O,  (*aw)0,  ('reader)O,  (*writer)0; 
int  ('fsyncer)O,  Cdefault)0; 

In  the  dient: 

isis_init(0); 

fd  -  r_open(alist,  Hem^name,  how,  (mode) 
char  *itemjname; 

r  Jseek(fd,  offset,  mode) 

nbytea  ”  r_read(fd,  buffer,  lea) 
char  'buffer; 

nbytes  •  r_write(fd,  buffer,  type,  ten) 
char  'buffer; 

nbytes  »  r jrwriteffd,  buffer,  type,  lea) 
char  'buffer; 

r_fsync(fd); 

r_dose(fd); 


A  simple  interface  to  replicated  data  is  presented.  The  interface  largely  emulates  the  existing 
UNIX  file  system  interface,  although  replicated  data  need  not  be  stored  on  disk.  Basically,  the 
managers  of  the  replicated  data  item  specify  the  routines  that  will  read,  write,  etc.  and  the  type  of 
broadcast  to  use  by  “default”  when  updating  this  data;  in  the  xwriteO  (“exclusive  mode  write”) 
case  CBCAST  is  always  used  regardless  of  the  default.  Normally,  the  default  routine  will  either 
be  CBCAST  or  ABCAST  (see  VSYNCfTK)).  CBCAST  would  be  used  for  structures  that  are 
insensitive  to  the  order  in  which  updates  are  done  to  different  data  items;  ABCAST  when  a  struc¬ 
ture  might  behave  differently  far  different  update  orders.  The  type  field  should  be  set  as  for  a 
msg_addfield  (see  MESSAGESfTK)). 

In  die  service,  the  routines  are  invoked  as  follows: 
opener(item_name,  how,  fmode) 

The  service  should  “open"  the  draignated  itemjuane,  the  arguments  can  be  defined  by  the 
service  but  are  intended  to  mimic  the  argument  to  the  unix  open  system  call.  The  opener 
routine  should  return  -1  on  an  error,  setting  the  global  variable  "r_emio"  to  the  error  code, 
and  it  should  return  0  if  the  open  succeeded.  No  state  is  saved  on  an  open.  The  designer 
should  be  aware  that  the  file  descriptor  returned  to  the  dient  will  not  be  determined  from 
tile  code  returned  by  the  open  routine.  The  client  file  descriptor  is  allocated  from  a  per- 
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client  table  of  open  "files",  corresponding  to  the  various  successful  opens  that  have  been 
done.  In  fact,  it  is  not  required  that  the  number  returned  by  different  representatives  of  the 
service  be  the  same.  If  some  representatives  return  error  codes,  the  request  is  assumed  to 
have  failed  and  one  of  those  codes  is  returned  to  the  caller.  If  all  return  success,  the  request 
succeeds.  The  designer  of  the  system  may  assume  that  the  sequence  of  r_open 
requests  is  seen  by  all  members  of  the  service,  but  that  the  sequence  of  r.dose  invocations 
may  differ  from  representative  to  representative.  The  designer  should  arrange  that  the 
representatives  are  left  in  equivalent  states  after  the  open  call  returns,  which  normally  means 
that  all  representatives  should  do  the  same  thing  on  an  open. 

sizer(item) 

The  service  should  return  die  size  of  the  file.  The  return  code  and  error  number  are  as 
above. 

reader(item,  offset,  buff er,len) 

The  service  should  do  a  read  at  the  designated  offset.  The  number  of  bytes  read  should  be 
returned,  or  -1  if  an  error  occurred  with  the  error  code  stored  in  r.errno. 

writer(item,  off set,  buff er.len) 

The  service  should  do  a  write  at  the  designated  offset.  The  number  of  bytes  written  should 
be  returned,  or  -1  if  an  error  occurred  with  the  error  code  stored  in  r_errno. 

fsyncer(item) 

The  service  should  do  the  equivalent  of  an  fsync(),  returning  only  when  the  outputs  previ¬ 
ously  done  on  the  item  have  completed. 

4.  What  about  recovery  from  Mores? 

After  a  failure,  a  member  of  the  recovery  service  may  want  to  rejoin  the  service.  It  should  do  this 
using  the  state  transfer  tool  described  in  STATELXFER(TK).  To  have  a  restart  initiated  automat¬ 
ically,  use  the  recovery  tool  described  in  RMGRfTK).  In  the  cue  of  a  recovery  from  total 
failure,  the  service  should  take  the  following  action  to  recover  its  state:  Save  checkpoints  of  the 
replicated  data  periodically  in  a  file,  using  fsync  to  verify  that  it  has  been  fully  flushed  to  disk.  It 
is  best  to  write  a  temp  file  and  then  rename  it  to  avoid  the  risk  that  a  crash  will  leave  you  with  a 
partially  updated  checkpoint.  Or,  you  could  update  a  stable  copy  of  the  file  on  every  write. 
Using  the  recovery  manager,  you  can  determine  whether  to  rejoin  the  group  (if  it  survived  you  or 
someone  else  recovered  first)  or  to  reload  your  checkpoint. 

i  mnicijoiia  on  rcpncim  n 

You  can  use  this  tool  in  conjunction  with  the  ISIS  transaction  facility  to  obtain  transactions  on 
replicated  files. 
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1.  Syaapafc 

A  package  of  routines  for  remote  execution  of  a  program. 


2.  Interface 

#indude  <  isis/d.  h> 

istsjnit(0); 


r_exec( sites,  prog,  args,  env,  user,  passwd,  alist) 
sitejd  *  sites; 

char  *prog,  ••args,  •*env,  ‘user,  'passwd; 
address  'aUst; 

This  toolkit  routine  is  provided  as  an  alternative  to  using  the  UNIX  “rexec”  facility.  The  specified 
program  is  executed  at  each  specified  site.  The  file  descriptors  that  are  initially  open  (stdin, 
stdout  and  stderr)  point  to  the  system  console  of  the  ■««<<«*  on  which  the  rexec  is  done,  so  be 
careful  what  you  print.  You  should  use  the  normal  UNIX  facility  if  you  prefer  for  these  to  point 
to  the  console  of  the  machine  where  the  program  was  run  from. 

Unfortunately,  the  UNIX  architecture  makes  it  impractical  to  return  an  definite  indication  of 
whether  the  exec  actually  succeeded.  However,  far  the  cases  where  it  was  apparently  possible  to 
do  the  rexec  and  it  was  attempted,  the  alist  will  contain  a  null-termtnated  list  of  process  addresses 
corresponding  to  the  processes  that  were  created  to  do  this.  Far  ernes  where  the  rexec  could  not 
be  done  but  some  detectable  error  occurred  the  alist  entry  for  the  corresponding  site/incam  will 
have  process-id  number  0  and  the  entry  field  will  be  equal  to  the  value  of  the  UNIX  errao  vari¬ 
able  at  that  site.  If  a  site/incam  is  not  operational,  no  alist  entry  will  be  made.  The  number  of 
entries  in  the  alist  is  returned,  or  an  error  code  if  the  CBCAST  to  the  rexec  processes  failed  for 
some  reason. 

Rexec  encrypts  its  message  to  prevent  unauthorized  appeal  to  the  user  name  /  password. 
Remotely,  it  encrypts  the  password  and  then  compares  this  with  the  version  in  the  local  password 
file.  This  mechanism  is  a  bit  awkward,  but  it  does  provide  at  least  a  modicum  of  security.  The 
remote  program  will  be  executed  in  the  home  directory  of  the  designated  user. 

4.  BUGS 

It  is  unfortunate  that  rexecQ  cannot  indicate  whether  the  exec  actually  succeeded.  The  user-id  and 
password  are  currently  ignored.  Everything  runs  under  the  iris  account. 
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1.  Synopri i 

A  toolkit  routine  for  in  recovery  from  failures.  Thu  tool  is  still  u«rfg  design.  The 

current  version  of  the  recovery  manager  consists  two  independent  parts:  Automatic  Process  Restart 
ing,  and  Process  Croup  Logging/Researting. 

2.  InSsrfhcs:  Aatsautk  Preeaas  Rastas! 

#indude  <isis/d.h> 

imgr_update(key,  program,  argv,  envp); 
char  ‘key; 
char  ’program; 
char  ’argvjQ,  ’envpQ; 


**i» Jnat(0) ; 


rmgr_register(key ) ; 
char  ’key  ; 

rmgr.unregisterO ; 

The  recovery  manager  keeps  a  database  of  programs  that  it  win  restart  after  a  site  recovers  from  a 
crash.  The  function  rmgr_updau  atomically  updates  the  restart  database.  Key  is  an  arbitrary 
string  of  up  to  RMLEN  characters  that  uniquely  identifies  an  entry  in  the  restart  database.  Pro¬ 
gram  is  a  program  name,  and  argv  and  envp  are  vectors  of  argument  and  environment  strings  as  in 
execve  (2).  Rmgr.update  searches  the  database  far  an  existing  entry  with  the  given  key.  If  such 
an  entry  is  found,  program,  argv,  and  envp  wiD  be  replaced  by  the  new  values;  otherwise  a  new 
entry  will  be  created.  Rmgr.update  may  be  used  to  delete  an  existh^  entry  by  specifying  pro¬ 
gram  as  a  null  pointer.  The  utility  program  nnupdate  (see  RMUPDATE(TK))  provides  a  simple 
user  interface  for  rmgr.update. 

The  recovery  manager  keeps  "watching"  all  processes  that  it  has  started  up.  Should  any  of  Hmmw 
abort,  say  due  to  a  software  error,  it  will  automatically  be  restarted  A  process  that  was  not 
started  by  the  recovery  manager  may  add  itself  to  the  list  of  processes  being  watched,  by  calling 
rmgr_regirter(key),  where  key  mast  refer  to  an  existing  entry  in  the  restart  database.  A  process 
that  wants  to  exit  without  being  restarted  has  to  call  rmgr _jairegister  before  exiting. 

All  routines  return  a  nonzero  value  in  the  case  of  an  error.  For  rmgr.update  the  following  two 
error  codes  are  defined: 

RMJELOCXED:  The  restart  database  is  locked  because  another  process  is  currently  updat¬ 
ing  it.  The  call  should  be  retried 

RM.ENOTPOUND:  Rmgr.update  was  called  to  delete  an  entry  (program  -  NULL),  but 
no  entry  with  the  given  key  exists  in  the  restart  database.. 

Rmgr.update  returns  a  negative  value  if  it  fails  for  any  other  reason. 

It  is  no  error  to  call  rmgr_register  if  the  process  is  already  registered  This  may  be  used  to  change 
the  restart  database  entry  associated  with  the  process.  Rmgr_register  does  not  check  whether  the 
given  key  exists  in  the  restart  database.  The  recovery  manager  will  print  a  message  on  ttderr  if  it 
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is  unable  to  restart  a  process  because  it  cannot  find  the  entry  in  the  restart  database. 


#indude  <isu/d  .h> 
iaisJnit(O); 
nngrJnitO; 


nngr_starUof(gid); 
address  fid; 

rmfr_jtopJof(pd); 
address  fid; 

nnfrjnfo  *rmfr_*etmfo(pgname,  nobkxk); 
char  *pfname; 
int  noblock; 


fid  *  rmgr_cre*te(nni) 
nnfrjnfo  *nni; 


fid  m  nnfrJoin(nni,  mp) 
nnfrjnfo  ,rmi; 
messafe  'mp; 

fid  *  rmgr_restart(pgMnie) 
char  *pgname; 


Defined  in  isis/d_nnfr.h  (induded  in  isia/d.h): 

typedef  struct  { 
int  nrunode; 
pfroup  rm.vieer, 

}  nnfrjnfo; 

#  define  RMJ>OG  0x01 
#define  RMJIECENT  0x02 

#  define  RM^SURE  0x04 


The  recovery  manafer  also  ssststs  in  recreatiaf  a  process  froup  after  aD  its  members  have 
crashed.  When  restartinf  a  process  froup  after  a  total  crash,  it  is  desirable  to  find  out  out  which 
process  was  the  last  one  to  fad.  For  this  purpose  a  lof  of  changes  in  the  process  group  view  is 
kept  on  stable  storage. 

(Al  least)  one  member  of  a  pfroup  at  each  site  should  can  mgr^ianjog  (the  process  has  to  be  a 
member  of  the  group).  This  aD  saves  the  current  pgroup  view  in  a  file  on  disk,  and  arranges  for 
the  file  to  be  updated  whenever  the  view  changes.  The  function  rmgrjtopjog  disables  automatic 
updates  to  the  view  file. 
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When  a  program  is  restarted  after  a  site  recovers  from  a  crash,  the  program  can  call  mgr _getinfo 
to  get  information  about  the  state  of  a  process  group  before  the  crash.  Rmgr_getinfo  reads  the 
last  pgroup  view  that  was  stored  on  disk  into  the  field  m_view  and  sets  some  flags  in  mjnode, 
wtndi  are  to  be  interpreted  as  follows: 

1.  RM.LOG  not  set:  A  pgroup  view  file  does  not  exist  or  it  is  empty.  Interpretation:  This  is 
the  first  time  that  a  process  at  this  site  becomes  a  member  of  the  group. 

2.  RMJLOG,  RM.SURE  set,  RM.RECENT  not  set:  A  view  has  been  read  from  disk;  how* 
ever,  this  view  is  not  the  most  recent  one  among  the  views  stared  in  view  files  at  other  sites. 
Interpretation:  A  process  (or  processes)  at  this  site  was  a  member  of  the  group  when  the 
site  (or  just  that  process(es))  crashed.  Other  members  of  the  group  at  other  sites  were  still 
alive  after  this  crash. 

3.  RMXOG,  RM_RECENT,  RM_SURE  set:  A  view  has  been  read  from  disk;  no  other  site 
has  a  more  recent  view  recorded  on  disk.  Interpretation:  All  members  of  the  group  have 
crashed.  This  site  was  one  of  the  last  sites  up  before  die  crash  oocurred.  Rm.view  contains 
die  list  of  last  survivers  before  die  crash. 

In  order  to  set  RM.RECENT  correctly,  rmgr_getinfo  may  need  to  aooess  view  files  at  other  sites. 
Therefore  rmgr_getinfo  might  block,  waiting  for  other  sites  to  recover,  before  h  can  decide 
whether  the  local  view  file  contains  the  most  recent  pgroup  view.  In  particular,  if  mgr  jetirfb 
returns  with  all  flags  set  (case  3.),  the  programmer  can  assume  that  all  sites  mentioned  in  rmjview 
have  recovered  from  the  crush.  This  fact  may  be  used  to  start  a  coordinator-cohort  style  protocol 
among  those  sites  for  application  specific  recovery. 

If  this  behavior  is  not  desired,  rmgr_getinfo  should  be  called  with  a  nan-zero  value  for  the  param¬ 
eter  noblock.  In  this  case  rmgr_getinfo  wiD  not  block,  but  it  may  return  with  RmLlOG, 
RM_RECENT  set,  and  RMJ5URE  not  set,  indicating  that  none  of  the  sites  that  are  currently  up 
has  a  more  recent  view  stored  on  disk. 

The  interpretation  given  above  is  only  valid,  if  the  recovery  manager  routines  are  used  according 
to  the  following  rules: 

1.  A  program  should  call  rmgr_starUog  as  soon  as  it  has  joined  the  group  and  has  completed 
local  initialization  actions.  If  die  program  is  restarted  after  a  crash,  rmgr_startJog  should  be 
called  after  rejoining  the  group  and  performing  local  deanupfrecovery  actions. 

2.  After  a  total  pgroup  crash  one  of  the  last  survivers  should  create  the  group  again  by  calling 
pg_create(pgname,  rm_view.pgjncam+l).  It  is  important  that  the  incarnation  number  of 
the  new  group  is  greater  than  the  one  recorded  last  on  disk.  Rmgr_startJ°g  should  be 
called  after  global  deamqp/recovery  actions  have  been  completed. 

The  routines  mgrjreate  and  mgr  Join  may  be  used  to  restart  a  process  group  after  a  total  crash. 
Rmgr.create  creates  a  new  incarnation  of  the  process  group  (based  on  the  rmgrjnfo  obtained  by 
rmgr_getinfo)  and  announces  the  new  gid  on  the  news  (see  NFWS(TK)).  This  routine  must  be 
called  by  one  of  the  last  survivers.  Other  recovering  group  members  should  call  rmgr  Join,  which 
will  wait  for  the  news  announcement,  and  will  then  join  the  new  group  incarnation.  Rmgr  Join 
checks  to  see  if  the  group  stiD/already  exists  before  it  blocks  waiting  for  news  announcements;  so  it 
can  also  be  used  to  rejoin  a  group  after  a  local  crash.  The  parameter  ’mp’  is  passed  to  pg  join  (see 
PGROUPfTIC)). 

Rmgr_restart  provides  a  simple,  minimal  interface  to  rmgr_getinfo,  rmgr.create,  and  rmgr  Join.  It 
assumes  that  there  is  only  one  member  of  a  process  group  at  each  site.  The  code  for  rmgr  .restart 
is  given  below;  it  illustrates  the  use  of  rmgr  jetinfo,  rmgr.create,  and  rmgr  Join: 

address  rmgr.restart(pgname) 
char  *pgname; 

{ 

rmgrjnfo  *rmi; 
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int  createjQag; 
address  gid; 
message  *mp; 


/• 

*  Find  out  whether  to  create  or  join  the  group 

•/ 

rmi  *  rmgr_getinfo(pgname,  0); 

if  (!  (rmi->munode  A  RM_LOG))  { 

/• 

*  No  previous  view  logged  at  this  site:  assume  this  is  the  first  time 

*  the  group  is  started  up.  Check  if  the  group  already  exists.  If  not, 

*  create  the  group. 

•/ 

gid  =  pgJookup(pgname) ; 
oreate_flag  *  (gid. site  =  =  0); 

}  else  if  (rmi->nmnode  A  RMJRECENT)  { 

r 

*  This  site  has  a  copy  of  the  most  recent  group  view  in  its  view  log  file: 

*  assume  that  all  group  members  have  crashed  and  that  this  site  was  one 

*  of  the  last  survivers.  If  this  site  is  the  first  oat  in  the  list  of  last 

*  survivers  then  create  the  new  group  incarnation;  otherwise  wait  for 

*  somebody  else  to  create  the  group,  and  then  join  it. 

•/ 

create _ilag  *  (rmi->rm_ view. pg_aHst[0]. site  *=  my_site_no); 

}  else  { 

/• 

*  The  group  view  stored  at  this  site  is  not  the  most  recent  one  in  the 

*  system:  — that  this  is  a  recovery  from  a  local  crash.  Simply 

*  rejoin  the  group,  but  use  rmgr_join  in  case  the  group  has  crashed 

*  in  between. 

•/ 

create _flag  «  0; 

} 

/***  create  or  join  group  **•/ 
if  (create  Jlag)  { 

return  rmgr_create(ran); 

}else{ 

mp  *  msg_newmsgO; 
gid  *  rmgr_jam(rmi,  mp); 
msg_delete(mp); 
return  gid; 

} 

} 

7.  Diagnostics 

In  of  an  error  rmgr_startJog  «Jd  rmgr_stopJog  return  a  nonzero  value,  rmgr_getinfo 
returns  a  null  pointer,  and  rmgr.create,  rmgr_join,  and  nngr_restart  return  NULLADDRESS. 
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S.  Bafi 

Entries  in  the  restart  database  are  stored  in  the  following  format: 

"key"  program  {argl,  arg2, ....  argn}  {envl,  env2, ....  envn} 

The  rmgr  will  not  work  properly  if  the  key  contains  quote  characters,  or  if  one  of  the  argument  or 
environment  parameters  contains  curly  braces  at  commas. 

A  much  fancier  recovery  manager  will  be  introduced  eventually.  The  functions  of  this  one  will  be 
preserved,  but  perhaps  not  the  interface  it  supports. 
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1.  Name 

rmupdate  •  update  the  restart  database  used  by  the  recovery  manager 

2.  Synopda 

rmupdate  &ite_no  key 

rmupdate  site_no  [-E]  key  program  argO  argl  ... 


Rmupdate  is  a  utility  program  for  updating  the  restart  database  used  by  the  recovery  manager.  It 
calls  rmgr.update  (see  RMGRfIK))  with  the  arguments  supplied  on  the  command  line.  Site _jw 
specifies  on  which  site  the  database  should  be  updated.  It  refers  to  the  site-id  of  the  site  as  it  is 
found  in  the  sites  file  (see  ADDRESS3N G(TK) ,  site  table).  The  parameters  key,  program,  argO, 
argl,  ...  are  passed  to  rmgr.update  without  change.  If  the  -E  option  is  used,  rmgr_update  is 
called  with  the  environment  from  which  rmupdate  was  started;  otherwise  rmgr.update  is  called 
with  an  empty  environment. 


4.  Framplea 

Assume  that  the  restart  database  at  site  2  does  not  yet  contain  an  entry  with  the  key  "testl".  The 
command 

rmupdate  2  testl  /isis/test/testprog  testprog  1461  -v 

creates  the  following  entry  in  the  database: 

"testl"  /isis/test/testprog  {testprog,  1461,  -v}  {} 

If  later  the  command 

rmupdate  2  -E  testl  /isis/test/testprog  testprog  1464 

is  issued,  the  entry  will  be  replaced  by  something  like 
"testl"  /isis/test/testprog  {testprog,  1464} 

{HOME=/isis/schmuck,  PATH= .  'Aisr/local’Aisr/bin,  TERM=vtlOO,  USER=schmuck} 

Finally,  the  command 
rmupdate  2  testl 

deletes  the  entry  from  the  restart  database. 

5.  Bogs 


It  is  currently  not  possible  to  specify  environment  values  explicitly. 
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1.  Synopsis 

A  package  of  routines  implementing  distributed  semaphores. 

2.  Interface 

#indude  <isis/d.h> 

isis_init(0); 


Fb(alist,  sname,  free_oq_faOure) 
address  *alist; 
char  'name; 
address  free_on.Jailure; 

Pg(alist,  sname,  free_ooJailure) 
address  'alist; 
char  'name; 
int  free_on_failure; 

Vb(alist,  sname) 
address  'alist; 
char  'name; 

Vg(alist,  sname) 
address  'alist; 
char  'name; 

semajrfer_out(addr,  len) 
cfaar  "addr; 
int  'len; 

semajrferjn(addr,  len) 
char  'addr; 
int  len; 

sema_dumpO 


t  - I - 

ji  InKuanO 

The  semaphore  tool  is  used  for  synchronization  in  a  process  group  setting.  By  employing  it,  a 
process  can  obtain  mutual  exclusion  with  respect  to  some  set  of  other  processes  that  know  of  the 
semaphores  it  is  using.  The  argument  "alist”  is  a  null-terminated  address  list  that  identifies  the 
processes  and  process  groups  where  the  semaphore  lives.  All  semaphore  routines  return  0  in  the 
normal  case  and  -1  if  the  processes  corresponding  to  the  alist  have  all  failed. 

The  tool  provides  both  binary  (true/false)  and  general  (integer  valued)  semaphores.  Each  sema¬ 
phores  is  identified  by  null  terminated  character  string,  which  need  not  be  declared  prior  to  the 
first  use.  A  general  semaphore  will  block  if  the  number  of  VgO  operations  done  since  the  sema¬ 
phore  was  first  referenced  is  smaller  than  the  number  of  PgO  operations  done  so  far,  inducting  the 
current  one.  A  binary  semaphore  will  block  unless  a  VbO  was  done  subsequent  to  the  last  PbO- 
That  is,  a  general  semaphore  is  initially  0  and  a  binary  semaphore  is  initially  false. 

The  semaphore  scheme  is  a  “fair”  one:  PQ  requests  are  satisfied  in  the  order  they  are  received. 
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The  argument  lifree_on_fai]ure”  indicates  how  the  semaphore  subsystem  should  handle  the  failure 
of  a  process  which  has  done  a  PO  and  has  not  yet  done  a  matdung  VO-  If  this  argument  is  null 
(actually,  NULLADDRESS),  the  semaphore  subsystem  will  not  worry  about  failures.  If  the  argu¬ 
ment  is  a  group  id,  the  mmaphnfe  system  win  watch  the  caller  by  monitoring  that  group,  to  which 
the  semaphore  holders  must  also  belong.  If  the  bolder  fails,  a  VO  of  the  appropriate  type  is  per¬ 
formed  automatically.  Semaphore  users  must  either  employ  this  mechanism  or  some  mechanism 
with  equivalent  functionality  to  avoid  deadlocks  when  a  semaphore  holder  fails.  Use  of  an  alter¬ 
nate  mechanism  might  be  more  appropriate  if  some  deanup  actions  must  be  taken  on  behalf  of 
the  semaphore  holder  before  the  mutual  exclusion  it  held  can  be  released. 

4.  Comment 

Semaphore  synchronization  is  compatible  with  all  tools  that  maintain  replicated  data. 

5.  State  transfer 

To  generate  a  block  containing  the  semaphore  “state",  call  sema_xfer_out;  it  will  assign  values  to 
addr  and  len  as  required  by  XFER(TK).  The  block  can  be  read  in  using  semajtferJnO.  Only 
one  block  is  needed  for  the  semaphore  state;  the  length  will  depend  on  die  number  of  semaphores 
in  active  use. 

6.  Bop  and  resti-ktioos 

To  make  use  of  the  free  on  failure  option,  a  semaphore  operation  must  be  applied  to  a  process 
group  to  which  the  caller  belongs.  This  is  because  free  on  failure  uses  the  watch  tool,  and  the 
watch  tool  currently  only  supports  watching  process  groups  to  winch  one  belongs.  This  restriction 
will  eventually  be  eliminated. 
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1.  Synopda 

Startup  sequence,  for  processes  using  the  TASK  mechanism.  Non-TASK  use  of  ISIS  is  not  yet 
supported,  so  you  MUST  use  this  interface. 

2.  Interlace 

#indude  "d.h" 

/*  Entry  code  by  which  eat_msgO  can  be  called  */ 

#define  EATJMSG  (USERJASE+O) 

main(argc,  argv) 
char  **argv; 

{ 

int  foreground^).  eatjnsgO; 

/•  Initialize  connection  to  ISIS  */ 
isis_init(0); 

/*  Initialize  toolkits  •/ 

...  toolkit  init  calls  ... 

/*  initialize  entry  points  this  dient  will  support  */ 
isis_entry(EAT__MSG,  eat_msg,  "eaLmsg"); 

/•  Fork  off  the  foreground  task,  if  any  •/ 
t_Jor  k_delayed(f oreground ,  0,  0); 

/*  Main  loop:  run  tasks  and  receive  messages  */ 
forever 
{ 

run_tasksO; 

isisjreadO;  f*  This  blocks,  but  see  below  */ 

} 

} 

/*  This  task  is  the  "mam"  procedure  of  the  program  */ 
foregroundQ 
{ 

...  stuff  ... 

} 

r  Routines  to  process  received  messages  */ 
eat_msg(mp) 
message  *mp; 

{ 

...  stuff ... 

} 

I*  Soft  recovery  from  an  ISIS  system  crash  that  left  me  running  7 
isis_failedO 
{ 

/*  In  fact,  I  prefer  to  print  a  message  and  die  V 
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retum(*l); 


The  above  program  executes  a  typical  strvip  sequence  by  initializing  a  connection  to  ISIS,  dedar¬ 
ing  the  messages  sent  to  entry  number  EATJMSG  will  be  handled  by  a  procedure  called 
eat_rmgO>  and  then  spawning  a  foreground  task  that  acts  as  the  "main  procedure’  for  the  pro¬ 
gram.  The  actual  main  procedure  then  loops  running  tasks  and  reading  messages;  it  may  also 
want  to  do  non-blocking  IO  on  other  I/O  channels.  Messages  from  the  ISIS  protocols  process  and 
other  remote  processes  are  received  over  the  file  descriptor  called  isis_socket  (a  global  integer). 
The  routine  isis_readO  will  read  a  message  over  this  socket,  blocking  until  one  is  received.  It  then 
spawns  a  task  to  deal  with  the  arriving  message. 

The  reader  should  refer  to  INTT(TK)  for  information  about  the  mysterious  argument  to  isisJnitO. 

To  do  non-blocking  IO  from  ISIS  it  would  be  best  to  change  the  main  loop  to  do  a  select.  For 
example,  the  following  code  either  reads  from  inis  or  from  the  file  descriptor  “spcLfdes",  depend¬ 
ing  on  which  one  has  data  available. 

#indude  <sys/types.h> 

#indude  <sys/time.h> 


forever 

{ 

fd_set  in_mask; 
extern  isis_socket; 
nm_tasksQ; 

FD_ZERO(&in_mask) ; 

FD_SET(isis_socket,  ddnjmask); 

FD_SET(spcLfdes,  Ainjnask); 

/*  Block  until  input  is  available  */ 

select(32,  &n_mask,  (fd_set*)0,  (fd_set*)0,  (struct  timeout *)0); 

/*  Read  from  ISIS  and  create  associated  task  to  run  later  */ 
if(FD_ISSET(isis_socket,  izunask)) 

isisjreadO;  f- 

/*  Read  from  spedad  file  descriptor  •/ 
if(FDJSSET(spdJdes,  in_maak)) 

tpcLreadO; 

} 

The  above  code  works  as  follows.  Within  a  single  address  space,  the  foreground  task  and  any 
active  message-processing  tasks  will  co-exist,  switching  off  using  a  coroutine  mechanism  styled 
after  monitors,  as  described  in  TASKS(TK).  In  this  particular  case,  arriving  messages  with  entry 
number  EAT_MSG  will  be  passed  to  the  eat_meO  routine,  which  should  process  the  message  and 
reply  if  necessary.  Meanwhile,  if  data  becomes  available  on  spcLfdes  die  routine  spcLread  will  be 
called;  it  should  either  do  the  read  immediately  or  fork  off  a  task  to  do  it  when  im_ tasks  is  next 
run  (in  fact,  one  might  simply  fork  off  the  task  directly,  as  in  t_fork_delayed(spd_read,  spcLfdes, 
0)). 

Unless  some  task  explicitly  calls  “exit”,  this  program  win  run  until  an  error  communicating  with 
ISIS  occurs  or  the  site  fads.  In  particular,  if  the  foreground  procedure  returns,  the  program  just 
becomes  a  passive  service  responding  only  to  new  requests.  If  the  foreground  procedure  remains 
active,  blocking  periodically  or  doing  calls  to  other  processes,  messages  will  be  received  while  it  is 
blocked.  If  it  enters  an  infinite  computational  loop,  it  wiD  not  be  interrupted.  In  addition,  if  it 
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reads  from  a  blocking  IO  device  like  the  keyboard,  messages  will  NOT  be  accepted  until  the  IO 
terminates. 

A  more  useful  foreground  procedure  would  be  the  following:  it  creates  a  group: 
foregroundO 
{ 

gid  =  pg_create("group_nameK); 
if(gid.site  ==  0) 

panic(  "create  of  group  <group_name>  failed 

> 

A  more  complex  mechanism  might  start  multiple  group  members  up  using  the  isis  remote  exec 
facility  (REXEC(TK))  and  verify  that  each  member  is  allowed  to  join  the  group.  An  example  of 
this  is  given  below. 

Here  is  a  second  example:  a  process  that  wishes  to  act  as  a  cheat  to  the  example_group  defined 
above. 

foregroundO 

{ 

/*  Example  of  process  group  creation  and  a  broadcast  */ 
static  address  addrs[2]; 
static  char  answ[8]; 
message  "mp; 

addrsfO]  ■  pgjookup("group_name") ; 
if(addtrs[0].site  —  0) 

pamc(”pgJookup  < group _name>  failed"); 
mp  =  msg_newmsgO; 
addrs[0].  entry  -  EAT_MSG; 

if(CBCAST(addrs,  mp,  ALL,  answ,  8,  (address*)0)  !=  1) 
panic("Got  unexpected  number  of  replies  from  CBCAST); 
printf(" After  CBCAST:  received  <%s>\n\  answ); 
exit(0); 

> 

It  should  be  noted  that  this  dient  doesn’t  obtain  “direct  access”  to  the  group.  To  give  it  direct 
access  it  would  be  sufficient  far  the  group  member  to  call  pg_adddient(sender)  in  the  eat_msg0 
routine.  The  termination  of  the  dient  would  automatically  trigger  a  pg_deldientO- 

The  code  below  consists  of  a  twenty -questions  program  and  a  question/answer  program  that  acts 
as  its  client.  The  twenty  questions  program  assumes  that  its  remote  representatives  will  start 
themselves  up.  More  realistic  would  be  to  use  the  ’r.exec*  facility  for  this  purpose,  but  this  would 
make  the  example  a  bit  too  complex. 

Both  programs  reference  the  following  indude  file: 

♦define  TWENTY.QUERY  (USERJIASE+0) 

♦define  TWENTY J5BJNIT  (U5ERJASE+1) 

♦define  TWENTY.CAT  0 
♦define  TWENTY.CLASS  1 
♦define  TWENTY.QUES  2 
♦define  TWENTY  J)B  3 

The  question  database  program  is  at  follows: 

r 
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*  A  twenty  questions  program 

•/ 

♦include  "d.h" 

♦include  "  twenty. h" 

int  mainjJrocO,  join_proc(),  twenty _db_imtO,  twenty.queryO; 

address  gid,  atoaddrO,  pgJookupO; 

int  maiiLprog; 
int  verbose; 
int  my_number; 
int  CLIENT-PORT; 

main(argc,  argy) 
char  ‘‘argy; 

{ 

while(argC“  >  1) 
switch^* ++argv) 

{ 

default: 

badarg: 

pamcCBad  argument:  <%s>\n",  *argv); 

switcfa((*argv)[l]) 

{ 

case  ‘m':  ++main_prog;  continue; 
case  V:  +  +verbose;  continue; 
default:  goto  badarg; 

} 

case  ’O’:  case  T:  case  Y:  case  *3’:  case  '4': 
case  ’5’:  case  ’6’:  case  T:  case  V:  case  *9*: 

CLIENT J*ORT  ■  atoi('argv); 
continue; 

} 

/*  Connect  to  ISIS,  then  fork  off  appropriate  procedure  */ 

isisJnit(CLIENTJPORT) ; 

pgJnitO; 


isis_entry (TWENTY .DBJNIT,  twenty _db_init,  "twenty.dbjnit’’); 
isu_entry(TWENTY_QUERY,  twenty  .query,  "twenty.query"); 


if  (mam-prog) 

Lfork_delayed(main_proc,  0.  0); 
else 

t_fork_delayedOoin_proc,  0,  0); 
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/*  Now  enter  ISIS  main  loop  */ 

forever 

{ 
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nnuta&loO; 

isis_read(); 

} 

} 

condition  mcount; 
int  mnembcn; 

/*  Monitor  changes  to  view  */ 
tmonQpd,  pg,  arg) 
address  gjd; 
pgroup’pg; 
char  *arg; 

{ 

if(pg->pg_nmenib  ==  nmembers) 

t_sig_delayed(<fancount,  0); 
pgjnonitor_cancel(gid,  tmon,  0); 


> 

#define  NMEMBERS  5 

#  define  NCAT  10 

#  define  NLINES  200 

#define  NFIELDS  10 

#define  STOLEN  10 

char  db[NUNES][NFIEU5S][SnULEN]; 
char  *aiames[NCAT]; 
int  nfields,  nlines,  neat; 

/*  Startup  of  the  rnain  program  */ 
maiiLprocO 
{ 

register  FILE  'file; 
char  answ{NMEMBERS]; 
register  q 

tf((file  -  fopen("queations.dat\  "r"))  »=  0) 

perror("questions.datM); 

^  pamcCcan’t  read  the  questions  database"); 

whfle((c  =  fgete(file))  >  0  StA  c  !■  \n’) 

register  char  *fp  -  db(0][nfidds++]; 
do 

•fp++  -  q 

whfle((c  =  fgetc(fae))  >04Aci=  V4Ac!=  V); 

•  _  A. 


STABTUP(TK)  DISTRIBUTED  SYSTEMS  TOOLKIT  STARTUP(TK) 


alines  =  1; 

while((c  =  fgetc(fDe))  >  0) 

{ 

register  a; 

for(n  =  0;  n  <  nfields;  n+-f) 

register  char  *sp  ■  db{nlines][n]; 
while(c  !=  \n’  M  c  !=  \t*  StA  c  >  0) 

•*p++  *  a 
•sp  -  0; 

} 

++nlines; 

} 

gid  =  pg_create("twenty .questions"); 
if(gid.site  *  *  0) 

panicCcan’t  create  the  process  group!"); 
nm embers  ■  NMEMBERS; 
pg_monitor(gid,  tmon,  0); 
t_wait(&mcount); 

twJnitO; 

begin 

{ 

address  addrs{2]; 
register  message  *mp; 
register  nrep; 

printf("[%d]:  %d  members,  %d  fields,  %d  lines  in  db,  neat  %cKn", 
my_number,  NMEMBERS,  nfields,  alines,  neat); 
addrs{0]  *  gid; 

addrs[0]. entry  »  TWENTY _X>BJNIT; 
addrsfl]  *  NULLADDRESS; 

mp  »  msg_genmsg(TWENTYJ3B,  db,  FTYPE_CHARS,  nlines  •  STOLEN  *  NFIELDS); 

nrep  =  CBCASTJEX(addis,  mp,  ALL,  answ,  1,  (adddreas*)0); 

msg_delete(mp); 

printf("%d  members  acknowledged  initializationVn",  nrep); 

} 

/*  Startup  of  a  sub-program  */ 
join_proc() 

{ 

register  message  *mp  -  msgjsewmsgO; 
gid  *  pgjookup("tweaty_qeestion0; 

if(gid.site  ■  «  0) 

pamcCpgJookup  faded"); 
if(pgJoin(gid,  mp)  !»  0) 
panic("pgJoin  failed"); 
msg_delete(mp); 


/*  sub-program  reception  of  a  database  */ 
twenty  _db_init(mp) 
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register  message  *mp; 

{ 

iat  dblen; 

dur  ’dbinit; 

dbinit  =  msg_j{etfield(mp1  TWENTY _DB,  1,  Adblen); 

bcopy(dbinit,  db,  dblen); 

for(nfields  =  0;  db[0][nfields][0];  nfields++) 
continue; 

for(nlinea  =*  1;  db{nlmes][0][0] ;  nlines++) 
continue; 

twJnitO; 

printf("[%d]:  %d  fields,  %d  lines  in  db,  neat  %d0, 
my_number,  nfields,  nlines,  neat); 

reply(mp,  *+",  FTYPE^CHARS,  1); 


*  Compute  various  stuff  from  db  and  from  view: 

*  nfields  -  fields  in  db 

*  nlines  =  length  in  lines  of  db 

*  neat  =  number  of  query  categories 

*  my_number  =  internal  ‘id’  of  this  process:  0..NMEMBERS-1 

*  In  TT  query  mode,  process  my_number*m  is  responsible  for  hues  1  s.t.  1  mod  m  =  0 

*  In  ’V’  query  mode,  das  process  is  responsible  for  row  r  s.t.  r  mod  m  *  0. 

*  Program  will  not  function  at  all  with  fewer  than  NMEMBERS  instances  running. 

•/ 

twJnitO 

{ 

register  n,  c, 

register  pgroup  *pg  *  pg  gctviewfgid): 

if(pg  0) 
pan»c("pwjnit"); 
neat  *  1; 
c  -  1; 

for(n  =  1:  n  <  nlines;  n++) 
if(stremp(db{n][Ol,  db(c][0])) 

{ 

cnames(ncal++]  «  db(n)[0); 
c  *  n; 

} 

for(n  -  0;  n  <  pg->pg_nmemb;  n++) 

if(cmp_addrew(<kpg->pg_alist(n],  Amy_address)  ■»  0) 

{ 

my_number  ”  n; 
break; 
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twenty  _query(rap) 
register  message  *mp; 

register  cat,  class,  f,  n; 
register  c bar  ‘query,  ‘beading; 

cat  =  ‘  (int ‘ ) msg_getfield(mp ,  TWENTY_CAT,  1,  (int*)0); 
cat  %=  neat; 

class  =  ‘(int‘)msg_getfield(mp,  TWENTY.CLASS,  1,  (int ‘JO); 
query  =  msg_getfield(mp,  TWENTY.QUERY,  1,  (int‘)0); 
heading  =  query; 
while(‘query  1=  ’=*) 

+  +  query; 

•query+  +  *  0; 

for(f  *  0;  f  <  nfields;  f++) 

if(strcmp(db(0][f],  heading)  -  ■  0) 
break; 

/*  In  H  mode,  everyone  answers.  In  V  mode,  only  one  answers  */ 
switdt(class) 

{ 

char  ‘answ; 
int  count; 

case  IT: 
answ  »  0; 
count  =  0; 
if(f  =  =  nfields) 
answ  =  "F; 

else  for(n  »  1;  n  <  nlines;  n++) 
if (stranp(db(n][0] ,  coamesfcat])) 
continue; 

else  if  (counts  ■+■  %  NMEMHERS  *  ■  myjoumber) 
if(strcmp(db(n][f],  query)  --  0) 

answ  *  (answ  AA  ‘answ  !=*  V)?  "7":  "Y": 
else 

answ  «  (answ  AA  ‘answ  !»  TT)?  "N"; 
reply (mp,  answ,  FTYPE.CHARS,  1); 
break; 

case  rV\ 

if  (f  %  NMEMHERS  !  ■  my  man  her) 
break; 

if(f  =*  nfields) 

answ  »  T"; 

else  for(n  ■  1;  n  <  alines;  n+  +) 
if(stramp(db(n][0],  cnames(cat])) 
continue; 

else  if(strcmp(db(n][f],  query)  -  *  0) 

answ  «  (answ  AA  ‘answ  !*  ’Y’)?  "Y"; 

else 

answ  *  (answ  Adi  ‘answ  !=  TV*)?  "N~; 

reply(mp,  answ,  FTYPE^CHARS,  1); 
break; 
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default: 

reply(mp,  FTYPE.CHARS,  1); 
bruk; 

} 

} 

Here  is  the  question- answer  program  that  the  user  sees: 

/• 

*  Front  end  program  for  playing  twenty  questions 

•/ 

#indude  "d.h" 

#indude  "twenty. h" 

int  verbose; 

int  CLIENT JXDRT; 

main(argc,  argv) 
char  '"argy; 

{ 

int  aslc^questionsO; 

while(argc-  >  1) 
switch(**++argv) 

{ 

case 

switch(*+  + *argv) 

{ 

case  V:  ++  verbose;  continue; 

default:  printf("-%c:  unknown  option\n",  **argv);  continue; 

} 

case  ’O’:  case  T:  case  T:  case  T:  case  ’4’: 
case  ’5’:  case  r6’:  case  T:  case  *8’:  case  V: 

CLIENT J>ORT  «  atoi('argv); 
continue; 

} 

/•  Connect  to  ISIS  • / 
isisJnit(CLIENT_PORT) ; 
pgJnitO; 

/•  Runs  as  a  task  */ 

t_fork_delayed(sak_questions,  0); 

forever 

{ 

nm^tasksO; 

isia_readO; 

} 

> 
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aak.questionsO 
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int  cat,  dass; 
char  string[120]; 
register  char  *sp; 
register  c; 
address  addrs{2]; 

addrs[0]  =  pgJookup("twenty_questians"); 

if(addrs(0].site  ==  0) 

pank(  "twenty-questions  asker  -  can’t  connect  to  database  program”); 
addrs[0].  entry  =  TWENTY_QUERY ; 
addrsflj  =  NULLADDRESS; 
printf("Welcome  to..,  twenty  questions !\n”); 
print( "Enter  a  random  number:  ”); 
sp  =  string; 

wfaile((c  -  getcharO)  !■  ’\n’) 

*sp+  +  =  c; 

*sp  -  0; 

cat  =  atd(string); 

printf( "Enter  H: query  or  V:query...\n”); 

forever 

{ 

print  ("Question?  "); 
c  =  getcharO; 
if(c  <=*  0) 
break; 
dass  a  c; 

while((c  »  getcharO)  1=  "\n’) 
if(c  ==  ’:’) 
break; 
sp  *  string; 

while((c  *  getcharO)  1=1  ^n’) 
if(c  !»  ’  ’  AA  c  !-  V) 

*sp+  +  »  c, 

*sp  -  0; 

if  ((dass  !-  Tf  AA  daas  !-  'V’)  |  strien(string)  -  -  0) 
printfC'Eater  Hxat»vahie  or  V xat= value... \n"); 
else 
{ 

register  message  *mp; 
register  nwant,  nrep; 
char  answ(20]; 

nwant  *  (dass  «  -  TT)?  1:  ALL; 

mp  -  msg_genmsg(TWE>rrY,CAT,  &cat,  FTYPEJLONG,  suceof(int), 
TWENTY.CLASS,  Sudan,  FTVTEJLONG,  uzeof(mt), 
TWENTY_QUES,  string,  FTYPE_CHARS,  strlen(string)+l, 

0); 

nrep  =*  CBCAST(addrs,  mp,  nwant,  answ,  1,  (adddreu*)0); 
answfnrep]  »  0; 
printfC\t%a\n",  answ); 

} 

} 

printfC'Bye.Xn"); 


STARTUPCTK) 


DISTRIBUTED  SYSTEMS  TOOLKIT 


STARTUPCTK) 


arit(0); 

} 

3.1.  Getting  fancy 

The  above  twenty  questions  program  is  not  really  very  fancy:  it  doesn’t  restart  itself  very  automat¬ 
ically.  Here  is  a  much  improved  version  that  automatically  starts  up  NMEMBER+NSTANDBY 
copies  of  itself  and  brings  up  a  new  standby  after  each  failure.  A  standby  takes  over  as  a  member 
instantly,  so  the  number  of  members  in  this  example  should  never  drop  below  NMEMBERS.  (If 
it  does,  however,  the  twenty  questions  program  shown  below  would  abort  itself  and  qa  would  get 
0  responses  to  all  its  queries  -  a  better  solution  to  this  is  proposed  below,  but  it  involves  changing 
qa  too). 

We  made  a  slight  change  to  the  qa- twenty -questions  interface  in  this  version:  it  returns  a  two-byte 
answer  to  queries  indicating  “who”  gave  the  answer  (a  number  0..NMEMBER-1)  and  what  the 
answer  they  give  was.  The  idea  is  that  even  though  the  task  assignment  may  vary,  a  caller  would 
always  get  exactly  one  answer  from  each  virtual  member. 

r 

*  A  fancier  twenty  questions  program 

•/ 


#indude  "d.h" 
#indude  " twenty. h" 


int  main_procO,  joinjjrocO,  twenty_query(),  helloO; 
int  start  0,  nextjlneQ,  restartjrferO; 
address  gid,  atoaddrO,  pgJookupO; 

int  mustjoin;  /*  Flag:  this  process  must  join  */ 

int  myjiumber;  /*  Virtual  member  number,  see  below  •/ 

int  CLIENT J’ORT; 


#  define  NMEMBER  5  /*  Wants  this  many  members  *1 

# define  NSTANDBY  2  /*  This  many  hot  standbys  */ 


char  db[NU'lES][NFIElI5S][STRLEN]; 
char  aiames[NCAT][STRLEN]; 
int  nfields,  nlines,  not; 


main(argc,  argv) 
char  **argv; 

{ 

wtiilefargc-  >  1) 
switch(*,++argv) 

{ 

default: 

badarg: 

pczsc(~Bad  argument:  *trgv); 


switch((*argv)(l]) 

{ 

case  'f:  ++ mustjoin;  continue; 
default:  goto  badarg; 
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} 

case  ’O’:  case  T:  case  *2’:  case  *3’:  case  ’4’: 
case  ’5’:  case  ’6’:  case  7’:  case  ’8‘:  case  ’9’: 

CUENT_PORT  =  atoi(*argv); 
continue; 

} 

/*  Connect  to  ISIS,  then  fork  off  appropriate  procedure  *1 

isisjinit(CLIENT_PORT) ; 

all  ow_xfers(  start,  nextjine,  restaruder); 

isis_entry(TWENTY_QUERY,  twenty,  =twaity*quCTy'j;' '  ' 

isis_entry(TWEhnvj^X^-'iSfi3b,  "hello"); 


if(must_join  ==»  0) 

tJorl^delayed(mainj>roc,  (char*)0,  (message*)0); 
else 

Uork^deiayed(jc«A_pnx,  (char*)0,  (message*)0); 

/*  Now  enter  ISIS  main  loop  */ 
forever 
{ 

run^tasksO; 

isisjreadO; 

} 

} 


static  pgroup  cur_pgview; 


/• 


*/ 


Monitor  changes  to  view...  all  members  see 
the  same  view,  so  die  coordinator  can  be  selected 
as  the  first  (=  —oldest)  fisted  member.  The  coordinator 
does  restarts  as  needed.  The  first  view  is  passed 
in  manually  after  pg_createO  but  treated  just  like 
any  other.  Only  starts  one  process  (if  any)  per 
invocation,  but  since  each  start  will  change  the  view, 
keeps  (bang  this  until  enough  members  are  running. 


tmon(pg) 
pgroup  *pg; 

{ 

address  gid; 


gid  *  Pg*>pg_gid; 
cur_pgview  =  *pg; 


/*  Repartition  the  database  based  on  new  view  •/ 
work_partition(pg) ; 


f*  Coordinator  is  the  oldest  member  of  the  group  •/ 
if(amp_address(pg.>pg_alist,  &my_address)  =  =  0) 
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{ 

if(pg*>pg_nmemb  <  NMEMBER+NSTANDBY) 
start_one(); 

} 


#define  TWENTY  Tfs/moose/b/isis'dient/twenty" 

char  'jargsQ 
={ 

"twenty",  "-f,  0,  0 

}; 

/• 

*  Start  new  program.  If  a  failure  takes  place  during  this  call,  it 

*  either  completes  first  and  the  member  is  seen  to  join  before  the 

*  failure  is  seen,  or  the  failure  is  seen  first  but  the  join  won't 

*  occur  -  a  nifty  use  of  virtual  synchrony  to  avoid  a  complicated 

*  mess  of  figuring  out  if  a  restart  was  in  progress  and  how  it 

*  terminated! 

7 

start_one() 

{ 

static  sno; 
static  sitejd  sid(2]; 
address  pname; 
register  sitejd  *sp; 
register  nsites; 
sview  *v,  *sv_getviewO; 
char  dient[30]; 

/• 

*  Pick  a  site  to  start  the  thing  at,  try  to  distribute  processes 

*  over  sites  in  a  reasonably  uniform  manner  so  all  won’t  run  at 

*  the  same  place.  v->sv_slistfsno]  is  the  site  we  settled  on. 

7 

v  *  sv_getview(); 
for(sp  *  v->sv_sBst;  *sp;  sp++) 
continue; 

nsites  =  sp-v->sv_sKst; 
if  (sno  >*  nsites) 
sno  =  0; 

•sid  =  v->sv_iliit{ino]; 

sprintf(dient,  "%d“,  OJENTJ>ORT+3,sno); 
jargs(2]  *  dient; 

r_exec(sid,  TWENTY,  jargs,  (char**)0,  "isis",  "nuDpass",  Apname); 
if  (pname.  site  -  «  0) 

panicC Can’t  rexec  'twenty'  at  site  %d^%dn", 

STIEJ^OCsid),  SITEJNCARNCsid)); 


|  /*  In  case  of  an  interrupted  state  transfer,  restart  where  it  left  off  V 

i 
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restart_xfer(bno) 

{ 

return(bno); 

} 

/*  Startup  of  the  main  program  */ 
maiix_proc0 
{ 

pgroup  *pg_getviewO; 
register  FILE  ’file; 
char  answ[NMEMBER]; 
register  c,  n; 


if((file  =  fopenCquestions.dat",  "r"))  ==  0) 

perror("questiom.dat"); 

panic("can’t  read  the  questions  database’*); 

do 

{ 

register  char  *fp  -  db(0][nfields++]; 
while((c  »  fgetc(file))  >0  AA  c  !=  ’in’  AA  c  !=*  \f) 
•fp++  =  c, 

•fp  =  0; 

} 

while(c  !=  \n’  AA  c  >  0); 
nlines  =  1; 
do 


{ 

for(n  =  0;  n  <  nfields;  n++) 

{ 

register  char  *sp  *  db{nlmes}[n]; 
while((c  *  fget<fik))  !=  *\n’  AA  c  !=* 
*H>+  +  *  c; 

•sp  0; 

} 

if(*db[nlines][0]) 

+  +nlines; 

} 

while(c  >  0); 
neat  *  0; 


c  =  0; 

for(n  -  l;n  <  nlines; n++) 
if (stronp(db(n][0] ,  db(c]{03)) 


strcpy(caames(ncBt+ +],  db(n][0]); 


V  AA  c  >  0) 


c  =  n; 

} 

/*  Now  start  thing*  by  creating  the  group. ..  •/ 
gid  »  pg_create("twenty_questions'\  0); 
if(gid.site  ==  0) 

pamcC can’t  create  the  process  group!"); 
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/•  Set  up  to  monitor  changes  */ 
pg_monitor(gid,  tmon,  (char')O); 

/•  First  view  won’t  get  sent  to  pg_monitor,  so  send  it  manually  •/ 
tmon(pg_getview(gid)) ; 

} 

r 

*  This  sets  up  to  start  a  state  transfer;  all  current  members  participate 

*  Since  all  see  the  current  pgroup  view  (copied  to  the  side  in  the 

*  tmon  routine),  just  copy  the  site  list  from  the  view  into  the  alist 

*  provided;  all  do  this  in  parallel  and  all  see  the  same  view,  so  all 

*  use  the  mum  alist.  This  is  the  simplest  way  to  generate  the  alist. 

*  we  could  also  have  copied  msg_getdests(mp).  There  is  no  obvious  reason 

*  to  favor  one  over  opposed  to  the  other  here.  (The  dests  field  will  have 

*  been  expanded  by  now,  of  course.) 

•/ 

start(mp,  who,  gid,  ap) 
register  message  *mp; 
register  address  *ap; 
address  who,  gid; 

{ 

address  *pg  ■  cur_pgview.pg_alist; 
do 

•ap*  •pg-M-; 
while(ap+ +->site); 

} 

/• 

*  Send  one  line  at  a  time,  which  is  pretty  inefficient  (too  small),  tat  for 

*  purposes  of  the  demo  illustrates  a  multi-block  transfer.  Actually,  should 

*  send  the  whole  db  at  once,  since  it  is  really  not  very  large. 

•/ 

next JineCline,  buffer,  type,  len) 
char  ••buffer; 
int  ’type,  *len; 

{ 

if  (line  >*  nlines) 
retum(-l); 

•buffer  *  dbpineflO]; 

•type  -  0; 

•len  *  nfields*  STOLEN; 
return(0); 

} 

/•  Get  a  line,  sent  above  •/ 
gotline(line,  buffer,  len) 
dux  ’buffer; 

{ 

bcopy (buffer,  dbfline],  len); 

} 
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hcDo(mp) 

register  message  *mp; 

if(my_number) 

return; 

reply(mp,  db,  FTYFE_CHARS,  NFTELDS'STRLEN); 


/*  Startup  of  a  sub- program  */ 
join_procO 
{ 

register  message  *mp  =  msg_newmsgO; 
register  rv,c,  n; 
int  gotlineQ; 


} 


gid  -  pgJookup("twenty_qu«tions"); 
if(gid.site  ««0) 

pamcCpgJookup  faflecT); 

if((rv  -  pg_joia.ancLxfer(gid,  mp,  gotline,  X-HG))  !=» 
pamcfpgjoin  failed:  rv  %d ",  rv); 
msg_delete(mp); 

for(nfields  *  0;  db(0][nfields][0];  nfiekh++) 
continue; 

for(nlines  -  1;  db(nlines][0][0];  ntoea-M-) 
continue; 
neat  *  0; 
c  *  0; 

for(n  -  1;  n  <  nlines;  n++) 


if(strcmp(db(n][0],  db(c][0])) 


strcpy(aiames(ncat++],  db(n][0j); 
c-  n; 


r 


o) 


*  same  trick  as  above,  although  this  process  is  unlikely  to  be  the 

•  coordinator  yet. 

V 

pg_monitor(gid,  tmon,  (char*)0); 
tmon(pg_getview(gid)); 


/• 

*  Each  time  the  group  view  changes,  divide  up  the  work. 

*  Crash  the  program  if  the  number  of  members  drops  too  low 

*  (shouldn’t  happen) 

*  The  idea  is  to  have  each  process  know  a  “virtual”  numhw 

*  that  defines  its  responsibility  for  some  rfmnfc  of  the 

*  if  my-number  is  i,  this  process  handles  PV*  mode  queries  for 

*  columns  that,  mod  NMEMHERS,  have  index  i,  end  H  mode  queries  for 

*  rows  that,  mod  numbers,  have  index  i. 

•/ 

work_partition(pg) 
register  pgroup  *pg; 


V. 
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{ 

register  address  *ap; 
static  was.up; 

if(was_up  AA  pg->pg_nmemb  <  NMEMBER) 

panic( 'Can’t  tolerate  more  than  %d  simultaneous  failures!",  NSTANDBY); 
else  tf(pg->pg_nmemb  >*  NMEMBER) 

+  + was.up; 

for(ap  *  pg->pg_alist;  ap->site;  ap+  +  ) 
if(anp_addxeaa(ap,  Amy  .address)  -  -  0) 

{ 

my  .number  »  ap-pg->pg_aHst; 
if  (my .number  >  *  NMEMBER) 

I*  Standby’s  get  negative  numbers  ’/ 
my  .number  *  NMEMBER-my  .number- 1 ; 
return; 

} 

pamcCwurkLpertition  ~  fm  not  in  the  alist  (never  happens)"); 

> 


’/ 

twenty.query(mp) 
register  message  *mp; 

{ 

register  cat,  daaa,  f,  n,  comp; 
register  char  'query,  'heading, 

query  *  msg  getfield(mp,  TWENTY.QUES,  1,  (int')0); 
if  (query  *  *  0  |  neat  *  =  0) 

{ 

print("BAD "); 
pmsg(mp); 
sncLreply(mp, 
return; 

} 

cat  »  *(int')msg_jetfield(mp,  TWENTY.CAT,  1,  (int’)0); 
cat  %“  neat; 

class  «  *(ist*)msg_getfield(mp,  TWENTY.CLASS,  1,  (int')0); 
heading  *  query; 

whiie( 'query  !■  *»’  AA  ’query  !*  ’>’  M  ’query  !■  ’<’  AA  ’query) 
++query; 
comp  *  ’query; 

•query*  +  -  0; 

fcnr(f  *  0;f  <  nfields;f++) 

if(strcmp(db(0](f],  heading)  ==  0) 
break; 

/’  In  H  mode,  everyone  answers.  In  V  mode,  only  one  answers  */ 
switch(daaa) 

{ 

char  *answ; 
int  count; 
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case  TT: 

if(my_numbex  <  0) 

{ 

/*  Hot  standby’s  don’t  send  responses  */ 

reply(mp,  (diar*)0,  0,  0); 

break; 

} 

answ  =  0; 
count  *  0; 
if(£  ==  nfields) 
answ  *  T"  ; 

else  for(n  -  1;  n  <  nlines;  n+  +) 

if(strcmp(db{n][0],  cnames{c*t])) 
continue; 

else  if  (count +  +  %  NMEMBER  »  »  mty_number) 

if(compare(comp,  db(n][f],  query)  --  0) 
answ  -  (answ  &A  *answ  !-  T*)?  "Y": 

else 

answ  -  (answ  AA  *answ  !■  ’N’)?  "N"; 

} 

if(answ  «  -  0) 
answ  » 

sndjreply(inpt  *answ); 
break; 

case  V: 

if(my_number  <  0) 
break; 

if(f  %  NMEMBER  !■  my^number) 
break; 
answ  -  0; 
if(f  *■  nfiekh) 
answ  -  T"; 

else  for(n  »  1;  n  <  nlines;  n+  -►) 
if(stranp(db(n](0],  cnamea(cat])) 
continue; 

else  if(compare(comp,  db{n](f],  query)  »  -  0) 
answ  ■  (answ  AA  ’answ  !■  ’Y’)?  "7":  "Y"; 
else 

answ  -  (answ  AA  *answ  !■  ’N’)?  "7":  "N"; 
if(answ  —  0) 

snd_reply(tnp,  *answ); 
break; 

default: 

snd_reply(mp,  ’•’); 
break; 

} 

} 
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/• 

*  This  version  sends  two-part  replies:  the  index  of  the  respondent  and 

*  the  answer  from  that  respondent.  Caller  will  get  exactly  one  answe 

*  from  each  respondent  as  long  as  the  number  of  processes  running  is 

*  at  least  NMEMBERS.  See  discussion  below  for  the  case  of  too  many 

*  failures  to  tolerate. 

•/ 

snd_reply(mp,  rep) 
register  message  *mp; 

{ 

char  answ{2]; 
answfO]  =  myjiumber; 
answflj  »  rep; 

reply(mp,  answ,  FTYPTLCHARS,  2); 

} 

/*  String  comparison,  implements  numeric  scaler  comparisons  too  */ 
compare(op,  si,  s2) 
char  *sl,  *s2; 

{ 

register  nl,  n2; 
if  (op  !«  ’<’  ASt  op  !■  ’>’) 
return(strcmp(sl,  s2)); 
nl  *  atoi(sl); 
n2  *  atoi(s2); 
if(op  ’<0 

retum(nl  >=  n2); 
retura(ttl  <»  n2); 

} 

We  promised  to  explain  how  we  could  have  handled  the  number  of  members  dropping  below 
NMEMBER  a  bit  more  gracefully.  Notice  that  aD  the  members  would  detect  this  situation,  its 
just  that  they  panic  in  this  example  instead  of  doing  anything.  A  better  solution  is  for  the  group 
to  reply  “unavailable”  Ode  member  would  have  to  do  "double  duty"  and  cover  for  the  missing 
members)  in  the  single-reply  query  mode,  or  the  QA  program  would  hang  in  that  case. 
Meanwhile,  the  coordinator  is  frantically  bringing  up  new  members,  so  with  hick  the  situation 
wouldn’t  persist  for  long.  A  qa  program  that  gets  an  unavailable  response  would  have  to  wait  a 
few  seconds  and  retry. 


I 

< 

i 


STATE_XTERfnO 


DISTRIBUTED  SYSTEMS  TOOLKIT 


STATEJCFERTTK) 


1.  Synapds 

A  toolkit  routine  for  transferring  state  from  a  process  group  to  a  process  that  is  joining  it 

#indude  <  istvd  h> 


/*  Client  side  •/ 

isisJnit(O); 

jotn_andjtfer(gid,  mp,  routine,  size) 
address  gid; 
message  *mp; 
int  (*routine)0; 

/•  Server  side  •/ 

aUow_jrfer(startjroutine,  data_routine,  restarl_routxne) 
int  (,startwroutine)0; 
int  (*data_routine)0; 
int  (,resunt_rounne)0  i 


The  state  transfer  tool  is  normally  used  by  a  process  that  wishes  to  join  an  existing  process  group 
without  preventing  clients  from  using  the  group,  but  needs  a  copy  of  same  state  information  to 
begin  functioning  normally.  The  tool  "hides”  the  join  and  state  transfer  event  so  that*  clients  see 
this  as  an  instantaneous  transition. 

4.  Whats  a  stale? 

The  tool  assumes  that  processes  can  represent  the  state  of  their  computation  in  some  number  of 
blocks,  which  can  have  arbitrary  and  variable  size.  The  programmer  must  somehow  write  code 
that  lets  the  tool  “read"  this  state,  one  “block”  at  s  time.  For  example,  in  the  twenty  questions 
program,  the  state  is  basically  the  contents  of  the  db’  data  structure  -  everything  else  can  be  com¬ 
puted  or  obtained  from  things  like  the  process  group  view.  In  a  transactional  application,  on  the 
other  band,  the  state  should  indude  locking  information  and  output  at  uncommitted  transactions 
So,  if  you  use  transactions  you  either  have  a  difficult  state  problem  to  overcome  (since 

that  tool  won't  give  you  this  information!)  or  must  do  the  transfer  when  nothing  that  matters  is 
running  ••  for  example,  by  acquiring  read  lodes  on  the  transactional  files  before  starting  the 
transfer. 

5.  Haw  la  aaa  the  taai 

The  process  wishing  to  join  the  group  invokes  joia_snd_xfer(),  specifying  the  group  to  fotn.  a 
message  for  the  join_ verifier  (see  py  joinf)  in  PGROUPS(TK))  and  •  state  reception  routine  The 
size  argument  indicates  if  the  transfer  will  be  a  big  one  (XJMG)  or  a  «m*n  one  (X. SMALL)  A 
large  transfer  is  done  using  a  TCP  stream  diannel  for  high  performance  and  would  normally 
require  that  multiple  data  blocks  be  computed  and  copied.  A  small  state  transfer  is  assumed  to  fit 
into  a  tingle  message,  which  could  be  fairly  large.  In  this  case,  the  overhead  of  a  connection  set¬ 
up  is  avoided,  fcwt  on  the  other  hand,  the  data  is  transferred  by  ISIS  and  hence  the  throughput  is 
quite  a  bit  lower  than  using  TCP. 

Jc*n_snd_xfer  operates  much  like  pg  joinf jid.  mp).  Assuming  that  the  join  succeeds,  however, 
the  transfer  tool  runs  a  coordinator-cohort  algorithm  in  which  the  action  routine  repeatedly 
requests  blocks  of  state  from  the  generation  routine  and  delivers  them  to  the  reception  routine 
The  cheat's  reception  routine  is  invoked  as  routine(bn,dau,blen);  where  bn  is  the  block  number 


1. 


2. 


3. 


New  process 
triggers  migration 
using  GBCAST 

TCP  transfer  used 
to  copy  state,  if 
large 


Old  member  drops  out 
with  another  GBCAST 


(messages  spooled) 


spooled  messages 
processed 


spooled  msgs.  discarded 


State  transfer  is  actually  a  3-step  algorithm 


Client 


Clients  view  state  transfers  and  migration  as  an  instantaneous  event 
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bang  transferred,  data  is  a  pointer  to  the  data,  and  blen  is  the  length  in  bytes  of  this  block  If  a 
failure  occurs  and  the  transfer  restarts,  the  client  will  see  the  block  numbers  reset  to  0  (a  future 
refinement  will  allow  the  transfer  to  resume  with  the  next  block  in  sequence,  but  this  is  not  vet 
implement^.  When  the  transfer  is  completed,  join_andjfer()  returns  0  if  it  succeeded  and  an 
error  code  from  pr_ errors. h  otherwise. 

°*  ^  Mdf  ^  group  ^oincd'  life  is  a  bit  more  complex.  To  allow  transfers  the 
members  of  this  service  must  all  invoke  the  allow jfer()  procedure,  specifying  the  routines’  that 
are  to  be  called  when  a  transfer  is  started  and  to  obtain  each  block  of  state.  These  routines  are 
invoked  as  follows. 


start_routine(mp,  who,  gid,  alist) 
message  *mp; 

address  who,  gid,  ahst[MAXJ»ROCS]; 

“  tl*  J!0inrVCriflCr.haS  validatcd  fr*  1°“  «««!*  (»«  PGROUPCIK)),  the  start  routine  is  called  to 
compute  the  set  of  group  components  that  will  participate  in  this  transfer.  The  routine  should  till 
in  this  alist,  for  example  by  copying  pg->pg_alist  from  the  current  view  of  process  group  gid.  AH 

members  will  observe  the  same  pgroup  view  when  the  startjoutine  is  called.  The  argument  mp  is 

a  pointer,  l°  ****  mtMa8t  froin  ^  join_and_xfex  request  and  the  argument  who  specifies  the 
address  of  the  process  that  is  joining.  ^ 

rval  =  datajoutine(bno,  data,  type,  len) 
int  bno,  'type,  *kn; 
char  **data; 


The  data  routine  is  responsible  for  computing  block  bno  of  a  transfer  and  returns  a  pointer  to  a 
region  containing  the  data  and  its  length  through  the  data  and  len  arguments.  The  type  field 
should  be  set  using  the  types  known  to  the  MSGJEDIT  system,  and  is  used  to  byte-swap  the  data 
being  transferred  if  necessary.  The  return  value  should  be  -1  if  the  transfer  has  terminated  f there 
is  no  block  corresponding  to  offset  bno  and  0  if  the  block  has  been  computed. 

After  a  failure  a  transfer  can  resume  either  at  the  first  block  (bno  =  0)  or  the  next  block  after  the 
last  successfully  transferred  one  or  any  block  number  in  between.  The  routine 

start_at  =  restart_routine(bno) 


wiD  be  invoked  with  the  number  of  the  last  transferred  block  +  1,  and  should  return  the  next 
bloat  number  to  use.  For  most  applications  this  routine  either  returns  bno  if  it  is  possible  to  just 
continue  with  the  next  block,  or  0,  say  if  the  block  sizes  or  contents  could  depend  on  the  sender 
even  though  several  processes  can  send  the  state.  (The  recipient  would  have  to  detect  the  discoid 
tinuity  and  cleanup  from  the  interrupted  transfer  if  the  application  requires  some  sort  of  cleanup 
action  before  restarting  at  block  0.)  ^ 


6.  Interactions  with  other  tools 


■Hus  version  of  the  state  transfer  tool  doesn’t  automatically  transfer  the  state  of  the  toolkit  rou¬ 
tines,  such  as  the  semaphore  state.  The  semaphore  tool  provides  a  way  to  generate  a  “block”  of 
state  information  and  read  it  in  remotely.  The  state  transfer  tool  itself  will  be  integrated  with  the 
tnmsartion  tool  m  such  a  manner  as  to  ensure  that  the  transfer  occurs  only  when  there  are  no 
write  locks  active  in  the  participating  processes.  We  plan  to  provide  some  sort  of  state  transfer 
option  for  the  transaction  file  and  core-data  structure  management  tools  too,  but  these  are  still 
undergoing  design. 


7c 
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1.  Synopsis 

A  package  of  routines  for  obtaining  and  monitoring  the  site  status  data  structure. 

2.  Interfile 

#indude  < isis/d. h> 

isisjnit(0); 


s view  *sv_getviewO 

sv_monitor(routme,  arg) 
int  (•routine)0; 

char  *arg; 

sv_monitor_canod(routme,  arg) 
int  (*routine)0; 

char  "arg; 

3.  Discussion 

per!uits  *  P”5*”111  to  aoccss  “d  monitor  the  site-view  data  structure  maintain,^  by 
the  ISIS  failure  detector.  The  fields  of  a  site-view  are;  sv_slistfl  (the  site-id’s  of  the  operational 
sites,  nuU-terminated),  sv_incara[i]  which  gives  the  current  incarnation  number  for  site  i,  sv  failed 
wtoch  lists  processes  that  failed  when  the  view  last  changed,  and  sv^ecovered,  which  lists  those 
that  recovered.  The  latter  two  are  both  bitveo.  The  data  structure  itself  is  define  ^ 
pr_sviews.h. 

The  routines  are  as  follows: 

a)  sv_getview()  obtains  the  most  current  site  view. 

b)  svjnonitorO  causes  the  specified  routine  to  be  invoked  as 

routinely,  arg) 

sview  *sv;  * 

char  *arg; 

■ 

each  time  the  site- view  change 

c)  sv_monitor_cancelO  cancels  an  sv_monitorO  request.  The  arguments  must  match  those  for 
the  sv_momtor\).  It  fails  if  the  monitor  request  is  unknown. 

Some  things  that  you  can  assume  about  site-views  indude:  the  svjlistfl  entries  are  ordered  accord¬ 
ing  iO  Hecrtasing  age  :  the  first  sv_slistQ  entry  is  the  site  that  has  been  up  longest  and  the  last  is 
the  site  that  came  up  most  recently.  The  bitveo  (see  BITVEC(TK))  sv  Jailed  and  sv_recovered 
imhcate  winch  site  s  have  undergone  a  failure/recovery  (never  both)  since  the  last  view  was  com¬ 
mitted.  Several  sites  could  change  status  in  a  single  change  of  site-view.  Finally,  the  svincamTI 
vector  gives  a  quick  way  to  check  the  incarnation  number  of  a  particular  site  in  order  The 
svjncamf]  entry  for  a  site  will  contain  an  “illegal”  value  if  the  site  is  down. 


76 


•--na.nV.. 


TASKSCTK) 


DISTRIBUTED  SYSTEMS  TOOLKIT 


TASKS(TK) 


1.  Synopsis 

An  overview  of  the  light  weight  task  facility  provided  by  ISIS. 

2.  Interface 

#indude  <  isis/d  .  h> 

t_fork_urgent(routine,arg,mp) ; 

t_f ork_delay  ed(routine ,  arg  ,mp) ; 
int  (*routine)0; 
char  *arg; 
message  *mp; 

value  =  t_wait(cond); 
int  value; 
condition  *cond; 

t_sig_urgent(cond, value) ; 

t_sig_delayed(cond,  value) ; 
condition  *cand; 
int  value; 

run_taaksO; 


3.  Discussion 

Although  normal  UNIX  systems  provide  only  a  single  thread  of  control  per  process,  it  has  been 
convemrat  in  ISL  to  pretend  that  each  process  consists  of  a  set  of  light-weight  tasks  that  share  a 
su^e  address  space.  The  routines  in  the  task  utility  provide  the  ISIS  diem  with  access  to  the 
aght- weight  tasking  medianism  we  implemented  for  this  purpose.  The  ruechanism  is  a  fake  in 
several  respects:  even  though  a  process  can  have  many  tasks,  the  entire  process  blocks  if  a  block¬ 
ing  system  call  is  performed.  Moreover,  there  is  no  true  concurrency  in  the  scheme,  although  it 
an  sometimes  look  as  if  there  is.  In  particular,  the  programmer  must  be  very  wary  about  possi¬ 
ble  race  conditions  whenever  a  task  might  suspend. 

“  t*?read  contro1  hflvin8  its  ^  stack  and  registers,  but  sharing  global  variables 
(including  static  ones)  with  other  tasks.  The  stack  of  a  task  is  limited  in  size,  currently  to  8k 
bytes,  and  if  tins  limit  is  exceeded,  the  error  will  not  be  detected.  However,  8k  bytes  is  really 
quite  a  substantial  amount  of  space  unless  recursion  is  attempted.  The  system  task  is  the  one  that 

P"**88  Startcd  «P.  and  it  has  an  unlimited  stack,  so  you  may  want  to  take 
advantage  of  this  if  an  algorithm  definitely  needs  more  than  8k  stack  space.  Certainly,  program¬ 
mers  who  work  with  tasks  will  need  to  avoid  allocating  large  amounts  of  data  on  the  stack  or 
recursive  algorithms. 

Like  corou&jo,  tasks  can  employ  signal  and  condition  variables  to  block  while  waiting  for  one- 
another.  The  basic  interface  is  as  follows: 

t_forkwurgent(routine,arg,mp) 

The  caller  is  suspended  in  a  runnable  state  and  a  new  task  is  created,  invoking  the  designated 
routine  as:  6  ^ 

routine(arg) 
char  *arg; 
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The  new  task  continues  to  execute  until  the  routine  returns,  at  which  point  the  task  ter- 
inmates  and  resources  it  used  (stack,  register  save  area)  are  freed  for  use  by  other  tasks 
The  message  pomter  should  normally  be  0.  If  it  is  non-zero,  then  msgjncxetountfmplwUI 
be  oiled  immediately  and  msg_delete(mp)  when  the  task  exists.  This  issued  in  thTTvstem 
itself  when  invoking  a  routine  after  message  reception  occurs.  5y5tCm 

t_fork_de!ayed(routine,arg,mp) 

^  ““  U  P'“d  “*  0“ 

value = t_wait(cond) 

^  JV. '  iable  *31*  eonditUm.  It  should  be  initialized  to  zero  explidtly  or  alio 
ca  ed  m  a  global  or  static  memory  location.  The  caller  suspends  in  a  writmgSe  Sfl  a 
nal  is  applied  to  the  condition  variable.  ^  8‘ 

t_sig_urgent(cond, value) 

The  first  task  waiting  on  cond  is  awakened  and  passed  the  designated  value.  The  caller  is 
suspended  in  a  runnable  state  on  the  runqueue. 

t_sig_delayed(cond,  value) 

Tte  first  task  waiting  on  cond  is  marked  as  runnable  and  placed  on  the  runqueue  When  it 
gets  to  run,  it  will  receive  the  inrHratfd  value.  ““  11 

run_tasksO 

This  routine  must  be  called  from  the  system  task  to  run  tasks  for  a  while.  It  returns  when 
there  are  no  more  runable  tasks  available.  For  example,  the  isis  "mainjoop"  loops  first  cal¬ 
ling  isis_readO  and  then  run_tasksO  .  ^  iwps,  nrst  cai- 

t_scheck0 

This  routine  checks  for  stack  overflow  and  calls  the  panic*)  routine  if  one  is  detected.  It  is 
invoked  automatically  when  switching  from  task  to  task,  but  will  not  detect  overflows  that 
happened  previously  if  the  stack  pointer  has  returned  to  a  safe  area  when  the  switching  takes 


4.  Casts 

. assoaated  with  the  task  mechanism  are  minimal.  A  “context  switch”  can  be  done  about 
Z SUW  workstation,  although  task  oeation^y^  alSI 
tack  ax  .a  needs  to  be  allocated  (re-use  of  an  old  one,  on  the  contrary,  is  nearly  free). 

5.  System  task 

The  task  that  runs  •Yun.tasks”  is  special:  it  is  called  the  system  task,  and  were  it  to  try  and  call 

***“  ^  K8*“-weight  task  facility  next  wants  m  sch«hS 
some  other  task.  Consequently,  the  system  task  must  never  try  to  block.  It  is  permitted  to  exe- 
““  Cfork_urgent.  Uork_dd«M  Uig.urga>t.  Uifcddayd  b«t  not 

6.  Caution 

Itu  crucial  that  the  user  of  this  package  keep  in  mind  that  although  a  task  must  suspend  itself 
opliatly  using  t_wait  or  t_fork_urgent  or  t_sig_urgent  to  be  blocked?  this  can  happ^Ta  result 

of  calli^  routmes  someone  else  has  coded.  The  rea^  this  is  a  major  issue  isTTo^e  a^! 

pTU!apIe  0thcr  task  could  wake  up,  and  this  means  that  global  variables 

"  ^  procedures  could  be  reentered,  although  using  a  different  stack  and 

rf  "d ,M1  v“i*bta-  p°**‘ 
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The  stack  overflow  check  should  be  done  automatically  on  every  procedure  call,  say  by  a 
modified  version  of  the  mcount  procedure  that  gets  linked  in  when  a  program  is  compiled  with 
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1.  Synopsis 

Nested  transaction  in  ISIS.  This  mechanism  is  based  on  one  from  ISIS-1,  but  doesn’t  require  that 
you  program  using  “resilient  objects’’.  The  code  hasn’t  all  been  ported  yet,  but  it  should  be 
usable  sometime  in  August. 

2.  Interlace 

#indude  <isiVd.h> 

- Transaction  control - 

/*  Start  a  new  (sub)transaction  */ 
t_begin(abort_on_failure) 
int  abort_on_failure; 

/*  Commit  a  (sub)transaction  •/ 
t_commitO; 

/'  Abort  a  (sub)transaction  •/ 
t_abortO; 

- Accessing  files  (stable  storage)  - 

t_sopen(fHe_name,  how,  fmode) 
char  •file.jiame; 

Osize(file_name) 
char  *file_name; 

t_sread(file_name,  offset,  buffer,  len) 
char  *file_name1  'buffer; 

t_swrite(file_jname,  offset,  buffer,  len) 
char  *file_name,  'buffer; 

Of  sync(file jaame) ; 
char  'file_narae; 

t_sdose(file_name) 
char  'file^name; 

- Accessing  in-core  storage  transact! onally  - 

t_copen(item_naine,  how,  fmode) 
char  'itemjname; 

t_csize(item_name) 
char  'itenounne; 

t_cread(item_name,  offset,  buffer,  len) 
char  'itemjMme,  'buffer; 


t_cwrite(itenx_name,  offset,  buffer,  len) 
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char  *it<*m_namf  ’buffer; 

t_cfsyoc(item_name) ; 
char  *itenx_namc; 

t_cclosc(item_name) 
char  ’itenuiame; 

•  Concurrency  control - 

t_r!ock(alist,  itanwname,  offset) 
char  *file_name; 

t_wlock(alist,  itcnuname,  offset); 

rfurr  *fi1e_nnme; 

■  Internal,  to  monitor  for  commit  and  abort  events 

t_monitor(routine) 
int  (*routine)0; 

t_outcome(tid) 
tram  *tid; 


3.  Dbcoafen 

t-begin,  tjcommix,  tjabort.  Although  ISIS  normally  does  not  run  in  a  transactional  mode,  the 
whole  system  is  compatible  with  transactions  in  a  way  that  makes  it  easy  to  obtain  them,  if 
desired.  To  turn  on  “transactional  execution”,  a  routine  simply  calls  t_beginO  and  later,  when  it 
terminates,  either  t_commit()  or  t_abort().  If  a  routine  is  called  by  another  transaction,  the  result 
is  a  nested  transaction.  The  caller  that  invokes  t_beginO  should  also  indicate  whether  this 
(sub)transaction  should  automatically  be  aborted  in  the  event  that  the  process  that  did  the  invoca¬ 
tion  should  crash,  or  its  site  should  fail.  The  only  case  when  abort_on_failure  should  ever  be  false 
(0)  is  when  the  transaction  is  being  done  in  a  coordinator -cohort  computation  and  some  cohort  will 
take  over  and  run  the  request  forward  to  completion,  doing  exactly  what  the  failed  coordinator  did 
(see  the  various  ISIS  papers  on  roll-forward  transactional  execution  for  details).  Normally,  you 
will  thus  request  abort_cm_failure  by  setting  this  flag  to  true  (1).  If  abort_onwfailure  is  false  but 
the  transaction  is  not  restarted  in  this  manner,  your  application  is  quite  likely  to  hang. 

tjopen,  ...  t_close.  These  routines  access  a  file  transact onaHy,  using  t_momtor  to  detect  the  ter¬ 
mination  of  each  transaction  or  subtransaction  automatically.  The  tjsxxx  versions  work  with  disk 
files  and  the  t_cxxx  versions  with  in-core  data  structures.  Checkpoints  are  needed  to  recover  from 
total  failures  in  the  latter  case;  this  is  automatic  when  using  stable  files.  They  can  be  called  "as  is” 
(as  are?),  or  can  be  called  from  the  replication  package  to  implement  replicated  files  (the  file 
name  should  refer  to  a  different  copy  of  the  file  in  each  replica  manager,  of  course).  If  used  in 
this  manner,  the  default  broadcast  primitive  should  be  CBCAST, 

tj-lock,  tjuvlock.  These  routines  support  transaction  read  (non-exdusive)  and  write  (exclusive) 
locking,  following  the  standard  2-ptuue  locking  protocol.  The  ahst  argument  indicates  where  the 
lock  in  question  lives.  Both  give  what  seems  to  be  "all  copies”  locking,  but  the  rlock  algorithm 
actually  is  asynchronous,  whereas  the  wlock  algorithm  is  a  slower  synchronous  one.  So,  read  locks 
are  much  cheaper  than  write  locks  in  the  case  of  replicated  data.  Lock  of  either  kind  will  be  reis¬ 
sued  silently  if  requested  more  than  once.  Note:  read  locks  are  never  “broken”  by  failures  in 
ISIS.  Note:  when  using  wlock  on  replicated  data,  take  care  that  all  callers  of  wlock  do  their  wlock 
calls  in  the  same  order,  or  deadlock  can  result.  For  example,  you  could  use  ABCAST  to 
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implement  a  group  RPC  and  then  have  all  members  call  wlock  in  parallel  on  their  private  data,  or 
your  could  use  CBCAST  to  implement  the  RPC  and  then  employ  a  coordinator-cohort  algorithm, 
this  time  having  only  the  coordinator  call  wlodc  and  specifying  the  group’s  id  in  the  alist  argu¬ 
ment. 

Semaphores  can  also  be  used  to  control  access  to  files  and  data,  but  they  ignore  the  transactional 
scope  rules  and  hence  could  get  you  into  trouble. 

4.  Descriptions  of  internal  routines 

Transaction  ids.  When  running  as  a  transaction,  ctp-> tasked  is  a  pointer  to  a  descriptive  struc¬ 
ture  characterizing  the  state  of  the  current  transaction.  This  can  printed  by  calling  t _print(ctp- 
>task_tid). 

tjnonitor.  This  routine  is  used  internally  by  ISIS  to  watch  for  the  commit  or  abort  of  a  transaction 
that  has  taken  actions  in  the  invoking  process.  The  routine  is  invoked  as: 

routine(how) 

{ 

} 

} 

where  the  argument  how  will  be  one  of  T_COMMln,  T_COMMIT2,  and  T-ABORT.  The  com¬ 
mit  phases  are  the  usual  ones  for  a  two-phase  commit.  If  you  plan  to  implement  your  own  tran¬ 
sactional  storage,  then  during  phase  1,  records  written  by  the  transaction  should  be  forced  to 
stable  storage.  If  stable  storage  is  not  an  issue,  take  no  action  during  pfaasel  commit.  During 
phase  2,  commit  records  can  be  written.  In  a  T_ABORT,  the  effects  of  the  transaction  must  be 
rolled  bade.  This  is  all  automatic  in  the  case  of  the  stable  storage  routines  provided  by  the  sys¬ 
tem. 

t _outcome.  This  routine  is  used  when  a  site  recovers  and  the  stable  storage  utility  discovers  that  it 
crashed  during  the  second  phase  of  the  commit  protocol  for  some  transaction.  If  all  the  sites  that 
know  what  this  outcome  was  are  down,  it  could  take  quite  a  while  for  this  routine  to  complete, 
and  while  it  is  running  access  to  the  file  the  transaction  updated  is  not  allowed  (both  t _rlock  and 
t_wlock  will  block).  On  the  other  hand,  the  updated  file  can  be  accessed  using  t_read  and  t.write 
without  acquiring  lodes  if  an  emergency  need  to  see  the  contents  arises.  One  would  obtain  the 
contents  of  the  file  as  if  the  transaction  did  commit  in  this  case.  Since  the  odds  are  overwhelming 
that  this  is  exactly  what  happened,  the  behavior  that  results  is  probably  fine. 

5.  Recovery 

There  are  several  cases: 

1.  In  the  case  where  you  arranged  for  the  stable  storage  routines  to  be  called  from  the  repli¬ 
cated  data  manager,  a  facility  is  provided  that  will  automatically  being  your  copy  of  the  file 
back  into  “sync”  with  any  others  after  recovery.  Use  the  rmgr  to  determine  which  of  the 
following  cases  applies.  If  your  program  is  the  first  to  recover  from  a  crash  of  all  members 
of  the  group,  it  can  use  its  local  copy  of  the  file  -  they  will  already  have  been  restored  to  a 
consistent  state  by  the  transaction  facility.  If  your  program  is  supposed  to  rejoin,  we 
currently  recommend  that  the  entire  file  be  copied  from  some  process  with  a  copy.  A  better 
mechanism  will  be  added  someday,  but  meanwhile  this  wifi  have  to  do. 

2.  When  using  the  in-core  storage  routines,  recovery  depends  on  how  the  failure  occurred.  If 
the  failure  causes  all  processes  to  crash,  you  must  have  a  checkpoint  around  in  order  to 
recover.  Assuming  that  you  do,  the  recovery  sequence  is  as  above,  but  using  the  checkpoint. 
You  can  make  a  checkpoint  anytime  the  data  is  idle,  but  not  during  a  transaction  that  has 
updated  it. 
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3.  In  the  case  where  in-core  transactional  storage  is  being  used  and  a  partial  failure  ta)ri»»  place, 
the  state  transfer  tool  should  be  used  to  copy  the  data  from  some  operational  process  pos¬ 
sessing  a  replica. 

6.  Examples 

Transactions  are  easiest  to  use  if  you  follow  very  stylized  coding  conventions.  Some  examples  to 
illustrate  the  most  common  cases  follow. 

6.1.  Non-repUcated  data 

6.2.  Replcated  data,  reaBent  object  style 
63.  Replcated  data,  CIRCUS  style 

6.4.  Replcated  data,  using  quorums 

7.  State  transfers 

This  wiU  look  something  like  the  semaphore  state _xfer_out/in  mechanism. 
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1.  Synoprii 

What's  this  virtual  synchrony  business  all  about  anyhow,  and  what  do  I  need  to  do  about  it? 

All  of  ISIS  is  built  using  a  collection  of  broadcast  communication  primitives  that,  if  used  correctly, 
provide  “virtually  synchronous"  distributed  executions.  This  idea  is  one  we  use  throughout  ISIS, 
and  it  can  greatly  simplify  the  design  of  even  very  high  level  software. 

Figure  1  shows  a  conventional  distributed  execution.  Such  an  execution  is  characterized  by  mes¬ 
sage  passing  and  the  discovery  of  occassional  “unexpected"  events,  such  as  crashes,  timeouts  due 
to  system  overloads  and  transient  communication  failures,  reception  of  new  “request"  messages 
while  pending  requests  are  still  underway,  and  differences  in  the  perceived  system  state,  from  pro¬ 
cess  to  process,  even  when  a  single  event  is  being  observed  from  multiple  perspectives.  An 
environment  like  this  one  is  dtfficult  to  work  with  -  we  call  it  a  completely  asynchronous  one  - 
and  it  provides  very  little  support  far  the  programmer  who  must  implement  a  distributed  applica¬ 
tion  program. 

In  Figure  2,  a  virtually  synchronous  execution  is  shown.  Such  an  execution  has  the  property  that 
it  looks  to  an  observer  (to  s  program  using  ISO,  in  particular)  aa  if  one  event  takes  place  at  a  time 
in  the  system.  That  is,  if  a  process  fails,  it  looks  as  if  no  communication  events  were  active  at  the 
time,  and  everyone  monitoring  for  the  failure  sees  it  occur  “simultaneously”.  When  a  communi¬ 
cation  action  occurs  (see  BCAST(TK)),  failures  never  seem  to  take  place  until  the  messages  have 
been  delivered,  and  it  looks  aa  if  no  other  communication  was  taking  (dace  at  die  same  time.  In 
fact,  each  message  indicates  the  processes  to  which  it  was  delivered,  even  though  the  message  may 
have  been  addressed  n«ng  process  groups  as  destinations  and  process  groups  have  dynamically 
varying  membership.  This  has  several  real  advantages  from  the  perspective  of  the  programmer 


Figure  1:  Conventional  mrssagr  paning  picture  of  a  system 
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who  works  with  ISIS 

One  relates  to  algorithm  design.  In  ISIS,  h  is  possible  to  deduce  the  action  that  other  processes 
will  take  by  just  looking  at  a  message,  its  destinations,  and  the  “state”  at  the  system  at  the  time 
the  message  arrives.  For  example,  this  might  include  the  membership  at  a  process  group 
(pg_ahstO  in  a  process  group  view),  the  sites  that  are  operational  (sLahstQ  in  a  site-view),  the 
contents  at  the  message,  or  even  user-supplied  data  structures  *  ««*«g  the  configuration 

tool  (see  CONFIOfTK)).  In  fact,  the  addresses  in  a  process-group  or  site  view  are  even  ordered 
according  to  increasing  age,  and  you  can  use  this  in  your  code.  Moreover,  aO  members  of  a  pro¬ 
cess  group  receive  a  broadcast  message  if  any  does  so  -  there  is  no  need  to  manually  mak?  sure 
that  everyone  has  their  copy.  RneOy,  actions  initiated  by  a  deed  proaeaa  are  terminated  before 
the  death  is  announced...  for  sample,  if  a  process  might  have  been  huh*  e  broadcast  or  adding 
a  member  to  a  process  group  or  taking  some  other  action  when  it  died,  cither  the  action  completes 
before  the  process  fahire  is  observed,  or  the  action  never  happens  at  aO  -  the  folure  preceded  it. 
Jointly,  these  aspects  really  amplify  Hfe:  they  riantnate  the  chit-chat  normally  needed  when  a 
group  of  processes  receive  a  message,  and  let  everyone  act  in  a  coordbated  fashion  without  taking 
any  actions  to  achieve  the  ooonlnetion.  Of  course,  h  may  be  rmrssary  for  group  members  to 
monitor  one  another,  but  prepackaged  tools  like  the  coonfinatar-cohort 
(COORD(TK))  and  the  monitoring  routines  (pgjnooitor  in  PGROUPS(TK)  and  the  wstdi()  rou¬ 
tine  in  WATCH(TK))  make  this  as  easy  as  possible.  Or,  one  can  arrange  far  the  cheat  of  e  group 
to  take  part  in  getting  an  action  done  by  simply  having  all  members  respond  to  “their  part"  of  a 
request,  and  having  the  cheat  collate  responses,  decide  if  the  action  really  got  done,  awl  re-issue 
the  request  if  necessary. 

One  implication  at  virtual  synchrony  is  that  most  ISIS  mechanisms  are  orthogonal  to  each  other. 
lienee,  you  cm  combine  semaphores,  state  transfers,  and  repttested  data  without  one  mechanism 
impacting  much  on  the  others.  Of  course,  this  isn’t  magical:  state  transfers  while  the  semaphore  is 
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Figure  2:  Virtual  synchrony  picture  with  e  process  group  and  a  few  dients 
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in  use  or  while  transaction*  are  running  can  be  a  bit  awkward.  But,  mechanism*  to  avoid  these 
problem*  are  provided  in  most  cases,  and  being  added  in  other*. 

In  addition,  several  of  the  data  structures  that  ISIS  maintains  have  properties  based  on  virtual  syn¬ 
chrony.  As  noted  above,  everyone  sees  the  same  sequence  of  pgroup  views  and  site  views  (see 
pg_monitor()  and  sv_monitorO)  and  within  these  views,  the  sequence  of  addresses  in  a  pg_alist  or 
site-id’s  in  an  sv.slist  are  ordered  according  to  decreasing  age.  Moreover,  the  current  pgroup  view 
at  the  time  a  message  is  received  is  the  same  for  aD  recipients.  Thus,  one  can  receive  a  message, 
check  the  current  view,  and  then  make  a  decision  in  such  a  way  that  only  one  process  is  responsi¬ 
ble  for  executing  the  message  and  all  others  are  backups  -  this  is  how  the  coordinator -cohort  algo¬ 
rithm  is  built  in  ISIS.  Once  you  get  used  to  taking  advantage  of  this  approach,  you  will  see  how  it 
simplifies  your  code:  rather  than  design  a  protocol  to  discuss  who  should  handle  a  request,  you 
design  a  simple  local  decision  that  everyone  who  receives  the  request  can  execute  in  parallel,  in 
such  a  way  that  aD  reach  the  same  decision  without  message  exchange. 

Another  advantage  to  the  ISIS  approach  is  that  there  are  genuinely  fewer  things  to  worry  about 
when  Aligning  code.  Basically,  you  can  use  a  finite  state  approach.  In  a  given  state,  your  pro¬ 
gram  may: 

1.  Be  waiting  far  a  response  to  tame  request  (or  several,  if  concurrent  tasks  are  running). 

2.  Detect  a  failure. 

3.  Receive  a  new  request. 

But,  since  aD  copies  of  the  program  see  these  events  in  the  “sane  order”,  and  everyone  sees  every 
event  that  concerns  them,  there  is  no  uncertainty  in  your  code:  for  each  possible  event,  you  simply 
specify  the  appropriate  actions  and  you  are  finished.  Unless  you  omit  the  case  of,  say,  a  failure 
occurring  while  your  program  is  waiting  for  a  response  from  some  process,  and  this  causes  a  crash 
to  occur,  code  that  covers  the  above  esses  wiD  cover  everything  necessary  for  correct  performance 
in  ISIS.  In  contrast,  imagine  the  uncertainty  of  executing  in  a  conventional  environment!  Failures 
may  be  detected  incorrectly,  messages  may  fail  to  reach  same  destination,  or  arrive  out  of  order, 
and  events  may  be  perceived  in  different  orders  by  different  processes  in  the  system.  The  sim¬ 
plest  actions  are  fraught  with  danger.  Many  programs  that  can  easily  be  written  using  ISIS  are. 
for  these  reasons,  nearly  impossible  to  write  in  any  other  way! 

What  does  virtual  synchrony  cast?  WeD,  the  cost  could  be  high  if  you  use  the  most  costly  broad¬ 
cast  primitive  (GBCAST)  too  casually,  usd  this  is  a  major  reason  for  using  the  toolkit  as  much  as 
possible.  The  tools  use  the  cheapest  broadcast  they  am,  and  performance  wiD  be  good  when  you 
stick  to  them  -  at  least  for  things  that  ISIS  is  good  at.  These  are  things  like  maintaining  repli¬ 
cated  or  recoverable  data  structures,  synchronizing  actions,  and  sending  requests  to  services  On 
the  other  hand,  bulk  data  transfers  are  best  accomplished  using  the  state  transfer  mechanism  or 
other  non-ISIS  mechanisim.  With  careful  attention  to  design,  performance  of  an  ISIS- based  appli¬ 
cation  can  be  as  good  or  better  than  for  a  non-ISIS  application. 

What  is  the  minimum  you  need  to  understand?  BasicaDy,  the  ISIS  user  has  two  lands  of  decisions 
to  make.  One  is  to  decide  bow  to  structure  the  application  into  process  groups  and  processes  and 
what  data  structures  wiD  be  needed.  Often,  the  ISIS  toolkit  routines  can  be  used  to  implement 
this  structure,  following  our  tutorial  examples,  but  in  many  cases  you  wiD  need  to  perform  “group 
RPC’  requests.  This  kadi  to  the  second  decision:  when  sending  a  message,  you  wiD  need  to 
decide  whether  an  answer  is  needed,  and  how  many  answer*  are  needed  (0,  1,  n,  or  ALL)  You 
wiD  also  need  to  determine  whether  the  group  is  actually  sensitive  to  the  order  in  which  it  receives 
this  type  of  requests  -  if  so,  you  should  use  ABCAST  to  send  the  requests,  if  not  CBCAST  is  pre¬ 
ferred.  For  example,  a  service  that  maintains  a  replicated  queue  of  some  sort  would  probably  be 
using  ABCAST:  queue  order  wiD  be  the  same  everywhere  if  requests  arrive  everyone  in  a 
fixed  order,  and  thats  exactly  what  ABCAST  provides.  On  the  other  hand,  a  service  that  main¬ 
tains  a  database  and  answers  questions  out  of  it  could  normaDy  be  accessed  using  CBCAST  per¬ 
formance  wiD  be  better,  and  in  this  case  the  order  in  which  queries  arrive  doesn’t  change  the 
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answer  that  should  be  given.  GBCAST  is  normally  used  only  in  the  toolkit  routines 

Who  should  answer  a  query?  The  easiest  solution  might  be  for  everyone  to  reply  (some  replies 
might  be  of  the  form  “I  don't  know",  indicated  by  celling  reply (mp,  0,  0,  0)).  Also,  keep  in  mmri 
that  one  can  reply  with  fewer  than  “slen"  bytes  of  reply  information  fBCAST(TK))  For  exam¬ 
ple,  if  the  reply  is  a  byte  string,  the  the  first  byte  could  be  a  axle  inAran^g  whether  the  rest  of 
the  reply  contains  valid  data.  If  you  prefer  to  receive  a  single  reply  the  coordinator -cohort  tool 
should  be  used.  Thu  has  some  overhead  and  the  amount  of  work  done  by  the  coordinator  should 
be  non- trivial  to  justify  paying  this  added  coat.  One  situation  in  wtach  the  tool  is  not  recom¬ 
mended  arises  when  the  reply  will  be  very  large.  It  might  seem  like  you  should  use  the  tool  to 
avoid  wasting  "space”  on  replies  from  processes  other  than  the  coordinator  In  fact,  however,  this 
would  be  just  the  situation  in  which  the  overhead  of  the  coordinator -cohort  algorithm  turns  out  to 
be  largest!  The  overhead  is  almost  zero,  on  the  other  hand,  if  the  coordinator  sends  the  data 
using  a  CBCAST  to  the  caller  and  then  replies  with  a  status  code,  say  an  integer 

To  summarize:  virtual  synchrony  makes  the  toolkit  possible  and  makes  algorithmic  design  supns- 
ingly  easy  in  ISIS.  The  benefits  are  substanbal,  but  the  programmer  may  be  expected  to  make 
some  decisions  that  could  affect  performance,  and  to  do  this  intelligently  requires  seme  under¬ 
standing  of  broadcast  orderings.  We  strongly  recommend  that  you  read  the  ISIS  papers  if  ttas 
applies  to  you:  documnitatioa  has  a  role,  as  do  tutorials,  but  tlx  papers  are  much  more  systematic 
in  attacking  this  sometimes  subtle  material. 
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1-  Synnpaii 

A  package  of  routines  implementing  a  watch  facility. 

2.  Interface 

#indude  <  isis/d  .  h> 


isis_init(0); 


/*  Watch  a  process  •/ 
wid  =  watcfa(addr,  gid,  routine,  arg) 
address  addr,  gid; 
int  (*routine)0; 
char  *arg; 

/’  Wait  for  a  process  to  join  a  process  group  of  which  caller  is  a  nwmhm-  •/ 
wid  -  watch_forfaddr,  gid,  routine,  arg) 
address  addr,  gid;  ■- 

int  (*routine)0; 

char  *arg; 

/*  Watch  a  site  *1 
wid  =  site_watch(sid,  routine,  arg) 
sitejd  sid; 
int  (*routine)0; 
char  *arg; 

/*  Cancel  a  watch  request  of  either  sort  •/ 
watch_canoel(wid) 
int  wid; 


watch_dumpO 

3.  Dfacxurion 

?!  fadli7  “La  1001  for  triBerin*  actions  “  event  that  the  status  of  a  process  or  site 
should  change.  In  the  case  of  a  process,  watchO  u  used  to  watch  for  failure,  whereas  watchforf) 
u  lued  to  watch^for  ^ory  The  argument  “addr"  should  give  the  address  of  the  process  to 
watch  (for)  and  pd  should  be  a  group  to  which  both  the  watched  process  and  the  caUerbeWs 

«“  **  sPeafied  “  NULLADDRESS,  bXh  case  the  routined, 
*  “]8°nthm  5ll°“  010111101  ®ty  P«wis  at  any  site  in  the  duster.  If  possible,  specify 

‘d. J^tcfa  much  cheaper  in  this  case.  In  the  case  of  a  site^watth,  the  behavior  ofS 

wat«±  faality  depends  on  the  site  incarnation  number  given  in  the  sid  If  this  number  is  non-zero 
‘k®  notdy  the  caDer  if  the  designated  site  fails.  If  the  incarnation  is  given  asTS 

wiD  ?°tify  **  caIler  makes  any  sort  of  a  transition:  from  up  to  down 

or  from  down  to  up.  In  aD  cases,  a  non-zero  unique  identifier  is  returned  and  can  be  used  to  can- 
cel  the  watch  later.  Watch  returns  0  tf  the  watch  is  impossible  because  the  event  has  already  taken 
place.  Watch  returns  -1  if  the  caller  is  not  a  member  of  the  designated  group. 

The  callback  routines  are  invoked  as  follows.  In  the  process  watch  case: 
routine(addr,  arg); 
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Note  that  when  doing  a  watdxJorO,  a  parallel  watchO  might  need  to  be  done  to  detect  and  han¬ 
dle  the  case  where  the  watched-for  process  fails  instead  of  joining  the  group.  Depending  on  which 
event  occurs,  the  other  event  would  be  canceled  when  the  routine  is  called. 

In  the  site  watch  case  tlx  call  sequence  is  as  follows: 
routine(sid,  arg); 

Callback  can  only  occur  when  a  message  received  by  the  watch  subsystem  triggers  a  site-view  or 
prooess-group  view  change  and  the  event  was  being  watched  for,  a  callback  can  take  place  immedi. 
«dy.  This  implies  that  the  isis_readO  routine  was  active,  hence  the  caller  can  assume  that  call¬ 
back  will  not  happen  “asynchronously”  during  other  computation. 

4.  Bugs 

Watch  with  NULLADDRESS  specified  as  the  gid  has  not  yet  been  implwn^nt^  It  will  be  imple¬ 
mented  when  the  transactional  facility  is  addH  to  KK  lfter  this  summer. 


